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ABSTRACT 

High-throughput microarray technology is an important and revolutionary technique 
used in genomics and systems biology to analyze the expression of thousands of genes 
simultaneously. The popular use of this technique has resulted in enormous repositories 
of microarray data, for example, the Gene Expression Omnibus (GEO), maintained by 
the National Center for Biotechnology Information (NCBI). However, an effective 
approach to optimally exploit these datasets in support of specific biological studies is 
still lacking. Specifically, an improved method is required to integrate data from 
multiple sources and to select only those datasets that meet an investigator's interest. In 
addition, to exploit the full power of microarray data, an effective method is required to 
determine the relationships among genes in the selected datasets and to interpret the 
biological meanings behind these relationships. 

To address these requirements, we have developed a machine learning based approach 
that includes: 

• An effective meta-analysis method to integrate microarray data from multiple 
sources; the method exploits information regarding the biological context of 
interest provided by the biologists. 



in 



• A novel and effective cluster analysis method to identify hidden patterns in 
selected data representing relationships between genes under the biological 
conditions of interest. 

• A novel motif finding method that discovers, not only the common transcription 
factor binding sites of co-regulated genes, but also the miRNA binding sites 
associated with the biological conditions. 

• A machine learning-based framework for microarray data analysis with a web 
application to run common analysis tasks on online. 

The form and content of this abstract are approved. I recommend its publication. 

Approved: Tom Altman 
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1. Introduction 

High-throughput microarray technology for determining global changes in gene 
expression is an important and revolutionary experimental paradigm that facilitates 
advances in functional genomics and systems biology. This technology allows 
measurement of the expression levels of thousands of genes with a single microarray. 
Widespread use of the technology is evident in the rapid growth of microarray datasets 
stored in public repositories. For example, since its inception in the early 1990s, the 
Gene Expression Omnibus (GEO), maintained by the National Center for 
Biotechnology Information (NCBI), has received thousands of data submissions 
representing more than 3 billion individual molecular abundance measurements [6,7]. 

1.1 Gene expression data analysis challenges 

The growth in microarray data deposition is reminiscent of the early days of 
GenBank, when exponential increases in publicly accessible biological sequence data 
drove the development of analytical techniques required for data analysis. However, 
unlike biological sequences, microarray datasets are not easily shared by the research 
community, resulting in many investigators being unable to exploit the full potential of 
these data. In spite of over 15 years of experience with microarray experiments, new 
paradigms for integrating and mining publicly available microarray results are still 
needed to promote widespread, investigator-driven research on shared data. In general, 
integrating multiple datasets is expected to yield more reliable and more valid results 
due to a larger number of samples and a reduction of study-specific biases. 



Ideally, gene expression data obtained in any laboratory, at any time, using any 
microarray technology should be comparable. However, this is not true in reality. 
Several issues arise when attempting to integrate microarray data generated using 
different array technologies. Meta-analysis and integrative data analysis only exploits 
the behavior of genes individually at the genotype level. In order to discover gene 
behaviors at the phenotype level by connecting genotype to phenotype, additional 
analyses are required. Cluster analysis, an unsupervised learning method, is a popular 
method that searches for patterns of gene expression associated with an experimental 
condition. Each experimental condition usually corresponds to a specific biological 
event, such as a drug treatment or a disease state, thereby allowing discovery of the 
drug targets, and drug and disease relationships. 

Numerous algorithms have been developed to address the problem of data 
clustering. These algorithms, notwithstanding their different approaches deal with the 
issue of determining the number and the construction of clusters. Although some cluster 
indices address this problem, they still have the drawback of model over-fitting. 
Alternative approaches, based on statistics with the log-likelihood estimator and a 
model parameter penalty mechanism, can eliminate over- fitting, but are still limited by 
assumptions regarding models of data distribution and by a slow convergence with 
model parameter estimation. Even when the number of clusters is known a priori, 
different clustering algorithms may provide different solutions, because of their 
dependence on the initialization parameters. Because most algorithms use an iterative 



process to estimate the model parameters while searching for optimal solutions, a 
solution that is a global optimum is not guaranteed [122,132]. 

1.2 Motivations and proposed methods 

This dissertation focuses on construction of a machine learning and data mining 
framework for discovery of biological meaning in gene expression data collected from 
multiple sources. Three tools have been developed: (i) a novel meta-analysis method for 
identifying expression profiles of interest; (ii) set of novel clustering methods to 
incorporate prior biological knowledge and to predict co-regulated genes; and (iii) a 
novel motif-finding method to extract information on shared regulatory sequences, 
Together these methods provide comprehensive analysis of gene expression data. 

1.2.1 Microarray meta-analysis 

Several methods have been proposed to combine inter-study microarray data. These 

methods fall into two major categories based on the level at which data integration is 
performed: meta-analysis and direct integration of expression values. Meta-analysis 
integrates data from different studies after they have been pre-analyzed individually. In 
contrast, the latter approach integrates microarray data from different studies at the level 
of expression after transforming expression values to numerically comparable measures 
and/or normalization and carries out the analysis on the combined dataset. 

Meta-analysis methods combine the results of analysis from individual studies in 
order to increase the ability to identify significantly expressed genes or altered samples 
across studies that used different microarray technologies (i.e., different platforms or 
different generations of the same platform). Lamb et al. [25] and Zhang et al. [26] 



proposed rank-based methods for performing meta-analysis. They started within the 
Connectivity Map (CMAP) datasets, a collection of one-off microarray screenings of 
>1000 small-molecule treatments of five different human cell lines. Genes in each array 
were ranked using statistical methods and the array data were then integrated, based on 
how they fit a given gene expression signature, defined as a set of genes responding to 
the biological context of interest. Gene expression signatures could then be used to 
query array data generated using other array technologies. These methods successfully 
identified a number of small molecules with rational and informative associations with 
known biological processes. However, Lamb et al. [25] weighted up-regulated genes 
and down-regulated genes differently, while Zhang et al. [26] showed that genes with 
the same amount of expression change should be weighted equally. Both Zhang et al. 
[26] and Lamb et al. [25] used fold change criteria (where "fold change", FC, is the 
ratio between the gene expression value of treated and untreated conditions) to rank the 
genes in expression profiles and they solved the problem of platform-specific probes by 
eliminating them. Because of the experimental design and the noise inherent in the 
screening process and imaging methods, a single gene may have probesets that differ in 
the direction of expression change, e.g., some probesets may be up-regulated and others 
down-regulated. This problem was addressed by Lamb et al. [25] and Zhang et al. [26] 
by including all probesets in the ranked list and averaging probesets across genes. In 
fact, it is not biologically necessary that a single differentially expressed gene show the 
identical FC for all probesets. It is not straightforward to compare differentially 
expressed genes just by averaging their probeset values. Witten et al. [9] showed that 



FC-based approaches, namely FC-difference and FC-ratio, are linear-comparison 
methods and are not appropriate for noisy data and the data from different experiments. 
In addition, these methods are unable to measure the statistical significance of the 
results. While approaches using the t-test were proposed, Deng [10] showed that the 
Rank-Product (RP) method is not only more stable than the t-test, but it also provides 
the statistical significance of the gene expression differences. Breitling et al. [29] 
proposed an RP -based approach that allows the analysis of gene expression profiles 
from different experiments and offers several advantages over the linear approaches, 
including the biologically intuitive FC criterion, using fewer model assumptions. 
However, both the RP and t-test methods, as well as other popular meta-analysis 
methods, have problems with small datasets and datasets with a small number of 
replicates. Hong et al. [30] proposed an RP-based method that is most appropriate, not 
only for small sample sizes, but also for increasing the performance on noisy data. Their 
method has achieved widespread acceptance and has been used in such diverse fields as 
RNAi analysis, proteomics, and machine learning [8]. However, most RP-based 
methods, unlike t-test based methods, do not provide an overall measure of differential 
expression for every gene within a given study [54]. 

We develop a meta-analysis method using RP first to determine the differentially 
expressed genes in each study and then to construct ordered lists of up and down 
regulated genes in each study based on the p-values determined using a permutation- 
based method. We then use the non-parametric rank-based pattern-matching strategy 
based on the Kolmogorov-Smirnov statistic [25] to filter the profiles of interest. We 



again use RP to filter genes which are differentially expressed across multiple studies. 
In our method, a t-test based approach can also be used in the first step to generate an 
estimated FC for each gene within each study. These values are then used in the cluster 
analysis, instead of the average FC values, to group the genes based on their expression 
change pattern similarity across multiple studies. 

1.2.2 Cluster analysis of gene expression data 

Numerous clustering algorithms are currently in use. Hierarchical clustering results 

in a tree structure, where genes on the same branch at the desired level are considered to 
be in the same cluster. While this structure provides a rich visualization of the gene 
clusters, it has the limitation that a gene can be assigned only to one branch, i.e., one 
cluster. This is not always biologically reasonable because most, if not all, genes have 
multiple functions. In addition, because of the mechanism of one-way assignment of 
genes to branches, the results may not be globally optimal. The alternative approach, 
partitioning clustering, includes two major methods, heuristic-based and model-based. 
The former assigns objects to clusters using a heuristic mechanism, while the latter uses 
quantifying uncertainty measures. Probability and possibility are the two uncertainty 
measures commonly used. While probabilistic bodies of evidence consist of singletons, 
possibilistic bodies of evidence are families of nested sets. Both probability and 
possibility measures are uniquely represented by distribution functions, but their 
normalization requirements are very different. Values of each probability distribution 
are required to add to 1, while for possibility distributions, the largest values are 
required to be 1. Moreover, the latter requirement may even be abandoned when 



possibility theory is formulated in terms of fuzzy sets [172-173, 179]. The mixture 
model with the expectation-maximization (EM) algorithm is a well-known method 
using the probability-based approach. This method has the advantages of a strong 
statistical basis and a statistics-based model selection. However, the EM algorithm 
converges slowly, particularly at regions where clusters overlap [27] and requires the 
data distribution to follow some specific distribution model. Because gene expression 
data are likely to contain overlapping clusters and do not always follow standard data 
distributions, (e.g., Gaussian, Chi-squared) [28], the mixture model with the EM 
method is not appropriate. 

Fuzzy clustering using the most popular algorithm, Fuzzy C-Means (FCM) [92, 
121], is another model-based clustering approach that uses the possibility measure. 
FCM both converges rapidly and allows assignment of objects to overlapping clusters 
using the fuzzifier factor, m, where l<m<oc. When the value of m changes, the cluster 
overlaps will change accordingly and there will be no overlapping regions in the 
clustering results when the value of m is equal to 1 . However, similar to the EM-based 
method and the other partitioning methods, FCM has the problem of determining the 
correct number of clusters. Even if the cluster number is known a priori, FCM may 
provide different cluster partitions. Cluster validation methods are required to determine 
the optimal cluster solution. Unfortunately, the clustering model of FCM is a 
possibility-based one. There is no straightforward statistics-based approach to evaluate 
a clustering model except the cluster validity indices based on the compactness and 
separation factors of the fuzzy partitions. However, there are problems with these 



validity indices; both with scaling the two factors and with over-fit estimates. We have 
combined the advantages of the EM and FCM methods, where FCM plays the key role 
in clustering the data, and proposed a method to convert the clustering possibility model 
into a probability one and then to use the Central Limit Theorem to compute the 
clustering statistics and determine the data distribution model that best describes the 
dataset. We applied the Bayesian method with log-likelihood ratio and Akaike 
Information Criterion measure using the estimated distribution model for a novel 
validation method for fuzzy clustering partitions [147]. 

1.2.3 DNA sequence analysis 

Genes within a cluster are expressed similarly under the given experimental 

conditions, e.g., a drug treatment, a comparison between the normal and disease tissues, 
or other biological comparisons of interest. Genes that are similarly expressed may have 
regulatory DNA sequences in common and appropriate analysis may identify motifs 
that are overrepresented in these gene sequences. Such motifs may contribute to 
expression of the entire group [31]. In addition, regarding clusters of down-regulated 
genes, we can apply the sequence analysis to the 3' untranslated regions (3 f UTR) 
potentially discovering motifs connected to miRNAs [32]. We may discover the 
relationships between the miRNAs and the diseases or the treatment conditions targeted 
by the microarray experiments. 

Most algorithms developed to find motifs in biological sequences do not 
successfully identify motifs containing gaps and do not allow a variable number of 
motif instances in different sequences. In addition, they may converge to local optima 



[33] and therefore do not guarantee that the motifs are indeed overrepresented in the 
sequence set. The MEME algorithm [34] solved many of these problems, but still does 
not model gapped motifs well. By combining the EM algorithm with the hierarchical 
genetics algorithm, we developed a novel motif discovery algorithm, HIGEDA [145], 
that automatically determines the motif consensus using Position Weight Matrix 
(PWM). By using a dynamic programming (DP) algorithm, HIGEDA also identifies 
gapped motifs. 

1.3 Research contributions 

The rapid growth of microarray repositories has increased the need for effective 
methods for integration of datasets among laboratories using different microarray 
technologies. The meta-analysis approach has the advantage of integrating data without 
the problem of scaling expression values and also has multiple supporting methods to 
detect and order significantly differentially expressed genes. However, parametric 
statistics methods may perform differently on different datasets, because of different 
assumptions about the method model. We propose using the RP method [8] to filter the 
genes for differential expression. We also propose to integrate RP with a pattern 
matching ranking method, in a method named RPPR [169], a combination of the Rank 
Product and Rank Page methods, to more effectively filter gene expression profiles 
from multiple studies. 

Meta-analysis of gene expression data does not put differentially expressed genes 
together at the phenotype level of the experiments. Therefore, the real power of 
microarrays is not completely exploited. Cluster analysis can discover relationships 



among genes, between genes and biological conditions, and between genes and 
biological processes. Most clustering algorithms are not appropriate for gene expression 
data analysis because they do not allow overlapping clusters and require the data to 
follow a specific distribution model. We developed a new clustering algorithm, 
fzGASCE [157] that combines the well-known optimization algorithm, Genetic 
Algorithm (GA), with FCM and fuzzy subtractive clustering to effectively cluster gene 
expression data without a priori specification of the number of clusters. Regarding the 
parameter initialization problem of the FCM algorithm, we developed a novel Fuzzy 
Subtractive Clustering (fzSC) algorithm [146] that uses a fuzzy partition of data instead 
of the data themselves. fzSC has advantages over Subtractive Clustering in that it does 
not require a priori specification of mountain peak and mountain radii, or the criterion 
on how to determine the number of clusters. In addition, the computational time of fzSC 
is O(cn) instead of 0(n ), where c and n are the numbers of clusters and data objects 
respectively. To address the problem of missing data in cluster analysis, we developed 
two imputation methods, fzDBI and fzPBI, using histogram based and probability based 
approaches to model the data distributions and apply the model in imputation of missing 
data. For the problem of cluster model selection and clustering result evaluation with 
the FCM algorithm, we developed a novel method, fzBLE, [147] that uses the 
likelihood function, a statistics goodness-of-fit, and a Bayesian method with the Central 
Limit Theorem, to effectively describe the data distribution and correctly select the 
optimal fuzzy partition. Using the statistical model of fzBLE, we developed a 
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probability based method for defuzzification of fuzzy partition that helps with 
generating classification information of data objects using fuzzy partition. 

In addition to clustering of gene expression data, we also propose an analysis 
directed at the biological relevance of the cluster results. Using our RPPR method, we 
provide relationships between the gene clusters and the biological conditions of interest, 
and determine the clusters which respond positively and negatively to the experimental 
conditions. Furthermore, our motif- finding algorithm, HIGEDA [145], can discover 
common motifs in the promoter and mRNA sequences of genes in a cluster. In addition 
to predicting the possible transcription factor binding sites, HIGEDA can also be used 
to predict potential miRNA binding sites. 

Figure 1-1 shows the microarray data analysis key issues which are addressed in 
this research. The remainder of the dissertation is organized as follows: Chapter 2 
presents the background of microarray analysis including microarray technology, 
microarray data analysis and recent approaches, and the goals of this research; Chapter 
3 describes a rigorous analysis of the Fuzzy C-Means algorithm; Chapter 4 describes 
our methods to address the challenges of gene expression microarray data analysis; and 
Chapter 5 demonstrates some specific applications of our approach, concludes our 
contributions and discusses our future work. 
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Figure 1-1 : Microarray data analysis and the contributions of this research 
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2. Microarray analysis background 

2.1 Microarray 

DNA microarrays rely on the specificity of hybridization between complementary 
nucleic acid sequences in DNA fragments (termed probes) immobilized on a solid 
surface and labeled RNA fragments isolated from biological samples of interest [6]. A 
typical DNA microarray consists of thousands of ordered sets of DNA fragments on a 
glass, filter, or silicon wafer. After hybridization, the signal intensity of each individual 
probe should correlate with the abundance of the labeled RNA complementary to that 
probe [1]. 

2.1.1 Microarray platforms 

DNA microarrays fall into two types based on the DNA fragments used to build 

the array: complementary DNA (cDNA) arrays and oligonucleotide arrays. Although a 
number of subtypes exist for each array type, spotted cDNA arrays and Affymetrix 
oligonucleotide arrays are the major platforms currently in use. The choice of which 
microarray platform to use is based on the research needs, cost, available expertise, and 
accessibility [1]. 

For cDNA arrays, cDNA probes, which are usually generated by a polymerase 
chain reaction (PCR) amplification of cDNA clone inserts (representing genes of 
interest), are robotically spotted on glass slides or filters. The immobilized sequences of 
cDNA probes may range greatly in length, but are usually much longer than those of the 
corresponding oligonucleotide probes. The major advantage of cDNA arrays is the 
flexibility in designing a custom array for specific purposes. Numerous genes can be 
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rapidly screened, which allows very quick elaboration of functional hypotheses without 
any a priori assumptions [120]. In addition, cDNA arrays typically cost approximately 
one-fourth as much as commercial oligonucleotide arrays. Flexibility and lower cost 
initially made cDNA arrays popular in academic research laboratories. However, the 
major disadvantage of these arrays is the amount of total input RNA needed. It is also 
difficult to have complete control over the design of the probe sequences. cDNA is 
generated by the enzyme reverse transcriptase RNA-dependent DNA polymerase, and 
like all DNA polymerases, it cannot initiate synthesis de novo, but requires a primer. It 
is therefore difficult to generate comprehensive coverage of all genes in a cell. 
Furthermore, managing large clone libraries, and the infrastructure of a relational 
database for keeping records, sequence verification and data extraction is a challenge 
for most laboratories. 

For oligonucleotide arrays, probes are comprised of short segments of DNA 
complementary to the RNA transcripts of interest and are synthesized directly on the 
surface of a silicon wafer. When compared with cDNA arrays, oligonucleotide arrays 
generally provide greater gene coverage, more consistency, and better quality control of 
the immobilized sequences. Other advantages include uniformity of probe length, the 
ability to discriminate gene splice variants, and the availability of carefully designed 
standard operating procedures. Another advantage particular to Affymetrix arrays is the 
ability to recover samples after hybridization to an array. This feature makes Affymetrix 
arrays attractive in situations where the amount of available tissue is limited. However, 
a major disadvantage is the high cost of arrays [1]. 



14 



Following hybridization, the image is processed to obtain the hybridization 
signals. There are two different ways to measure signal intensity. In the two-color 
fluorescence hybridization scheme, the RNA from experimental and control samples 
(referred to as target RNAs) are differentially labeled with fluorescent dyes (Cye5 - red 
vs. Cye3 - green) and hybridized to the same array. When the region of the probe is 
fluorescently illuminated, both the experimental and control target RNAs fluorescence 
and the relative balance of red versus green fluorescence indicate the relative expression 
levels of experimental and control target RNAs. Therefore, gene expression values are 
reported as ratios between the two fluorescent values [1]. 
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Figure 2-1 : Microarray design and screening 



15 



Affymetrix oligonucleotide arrays use a one-color fluorescence hybridization 
system where experimental RNA is labeled with a single fluorescent dye and hybridized 
to an oligonucleotide array. After hybridization, the fluorescence intensity from each 
spot on the array provides a measurement of the abundance of the corresponding target 
RNA. A second array is then hybridized to the control RNA, allowing calculation of 
expression differences. Because Affymetrix array screening generally follows a 
standard protocol, results from different experiments in different laboratories can 
theoretically be combined [6]. 

Following image processing, the digitized gene expression data need to be pre- 
processed for data normalization before carrying out further analysis. 

2.1.2 Gene expression microarray data analysis 

Regarding differentially expressed genes, many protocols use a cutoff of a 

twofold difference as a criterion. However, this arbitrary cutoff value may be either too 
high or too low depending on the data variability. In addition, inherent data variability is 
not taken into account. A data point above or below the cutoff line could be there by 
chance or by error. To ensure that a gene is truly differentially expressed requires 
multiple replicate experiments and statistical testing [8]. However, not all microarray 
experiments are array-replicated. We, therefore, need to analyze data generated by 
different protocols. 

Statistical analysis, including meta-analysis, uses microarrays to study genes in 
isolation while the real power of microarrays is their ability to study the relationships 
between genes and to identify genes or samples that behave in a similar manner. 
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Machine learning and data mining approaches have been developed for further analysis 
procedures. These approaches can be divided into unsupervised and supervised 
methods. Unsupervised methods involve the aggregation of samples, genes, or both into 
different clusters based on the distance between measured gene expression values. The 
goal of clustering is to group objects with similar properties, leading to clusters where 
the distance measure is small within clusters and large between clusters. Several 
clustering methods from classical pattern recognition, such as hierarchical clustering, K- 
Means clustering, fuzzy C-Means clustering, and self-organizing maps, have been 
applied to microarray data analysis. Using unsupervised methods, we can search the 
resulting clusters for candidate genes or treatments whose expression patterns could be 
associated with a given biological condition or gene expression signature. The 
advantage of this method is that it is unbiased and allows for identification of significant 
structure in a complex dataset without any prior information about the objects. 

In contrast, supervised methods integrate the knowledge of sample class 
information into the analysis with the goal of identifying expression patterns (i.e., gene 
expression signatures) which could be used to classify unknown samples according to 
their biological characteristics. A training dataset, consisting of gene expression values 
and sample class labels, is used to select a subset of expressed genes that have the most 
discriminative power between the classes. It is then used to build a predictive model, 
also called a classifier (e.g., k-nearest neighbors, neural network, support vector 
machines), which takes gene expression values of the pre-selected set of genes of an 
unknown sample as input and outputs the predicted class label of the sample. 
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Figure 2-2 : Microarray analysis 
2.2 Microarray data meta-analysis 

2.2.1 Microarray data integration 

The integrated analysis of data from multiple studies generally promises to 

increase statistical power, generalizability, and reliability, while decreasing the cost of 
analysis, because it is performed using a larger number of samples and the effects of 
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individual study- specific biases are reduced. There are two common approaches for the 
problem of integrating microarray data: meta-analysis and direct integration of 
expression values. 

The direct integration procedure [11-13, 17] can be divided into the following 
steps. First, a list of genes common to multiple distinct microarray platforms is 
extracted based on cross-referencing the annotation of each probe set represented on the 
microarrays. Cross-referencing of expression data is usually achieved using UniGene or 
LocusLink/EntrezGene databases or the best matching mapping provided by 
Affymetrix. Next, for each individual dataset, numerically comparable quantities are 
derived from the expression values of genes in the common list by application of 
specific data transformation and normalization methods. Finally, the newly derived 
quantities from individual datasets are combined. Direct integration methods include the 
following: Ramaswamy et al. [14] re-scaled expression values of a common set of genes 
for each of five cancer microarray datasets generated by independent laboratories using 
different microarray platforms. Combining them to form a dataset with increased 
sample size allowed identification of a gene expression signature that distinguished 
primary from metastatic cancers. Bloom et al. [15] used a scaling approach based on 
measurements from one common control sample to integrate microarray data from 
different platforms. Shen et al. [16] proposed a Bayesian mixture model to transform 
each raw expression value into a probability of differential expression (POD) for each 
gene in each independent array. Integrating multiple studies on the common probability 
scale of POD, they developed a 90-gene meta-signature that predicted relapse-free 
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survival in breast cancer patients with improved statistical power and reliability. In 
addition to common data transformation and normalization procedures, Jiang et al. [17] 
proposed a distribution transformation method to transform multiple datasets into a 
similar distribution before data integration. 

Data processed by distribution transformation showed a greatly improved 
consistency in gene expression patterns between multiple datasets. Warnat et al. [18] 
used two data integration methods, median rank scores and quartile discretization, to 
derive numerically comparable measures of gene expression from independent datasets. 
These transformed data were then integrated and used to build support vector machine 
classifiers for cancer classification. Their results showed that cancer classification based 
on microarray data could be greatly improved by integrating multiple datasets with a 
similar focus. The classifiers built from integrated data showed high predictive power 
and improved generalization performance. 

A major limitation of these direct integration methods is that filtering genes to 
generate a subset common to multiple distinct microarray platforms often excludes 
many thousands of genes, some of which may be significant. However, data 
transformation and normalization methods are resource-sensitive [20]; one method may 
be a best fit for some datasets, but not for others. It is difficult to come to a consensus 
regarding a method that is best for data transformation and normalization on given 
datasets. 

Several studies have shown that expression measurements from cDNA and 
oligonucleotide arrays may show poor correlation and may not be directly comparable 
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[21]. These differences may be due to variances in probe content, deposition 
technologies, labeling and hybridizing protocols, as well as data extraction procedures 
(e.g., background correction, normalization, and calculation of expression values). For 
example, cDNA microarray data is usually defined as ratios between experimental and 
control values and cannot be directly compared with oligonucleotide microarray data 
that are defined as expression values of experimental samples. 

Across-laboratory comparisons of microarray data has also demonstrated that 
sometimes there are larger differences between data obtained in different laboratories 
using the same microarray technology than data obtained in the same laboratory using 
different microarray technologies [22]. Wang et al. [116] also showed that the 
agreement between two technologies within the same lab was greater than that between 
two labs using the same technology; the lab effect, especially when confounded with the 
RNA sample effect, usually plays a bigger role than the platform effect on data 
agreement. Commercial microarrays, such as Affymetrix arrays, have produced several 
generations of arrays to keep up with advances in genomic sequence analysis. The 
number of known genes and the representative composition of gene sequences are 
frequently updated and probe sets are modified or added, to better detect target 
sequences and to represent newly discovered genes. A recent study has shown that 
expression measurements within one generation of Affymetrix arrays are highly 
reproducible, but that reproducibility across generations depends on the degree of 
similarity of the probe sets and the levels of expression measurements [17]. Therefore, 
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even when using the same microarray technology, different generations of microarrays 
make direct integration difficult. 

Technical variabilities, which result from differences in sample composition and 
preparation, experimental protocols and parameters, RNA quality, and array quality, 
pose further challenges to the direct integration of microarray data from independent 
studies. Vert et al. [20], and Irizarry et al. [22] described the lab-affect for microarray 
data and concluded that direct integration of expression data is not appropriate. Cahan et 
al. [21] and Warnat et al. [24], using two methods to identify differentially expressed 
genes prior to carrying out a classification analysis, showed that gene expression levels 
themselves could not be directly compared between different platforms. Therefore, we 
propose to utilize the meta-analysis, instead of direct integration, before further 
computer analyses. 

The meta-analysis method, in contrast to the direct integration method, 
combines results from individual analyses. It therefore avoids the problem of scaling the 
gene expression levels among datasets from different laboratories. A number of studies 
have shown that meta-analysis provides a robust list of differentially expressed genes 
[21- 23]. With the increasing use of next generation sequencing techniques, microarrays 
are no longer the only high-throughput technology for gene expression studies. The 
direct integration approach appears to be inappropriate to those repositories, because we 
can only scale the expression values across multiple experiments using the same 
technology. Even with the same technology, e.g., Affymetrix oligonucleotide and 
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cDNA microarrays, the use of this approach seems impossible with scaling expression 
values. 

2.2.2 Issues with microarray data meta-analysis 

Considering that meta-analysis involves the whole process from choosing 

microarray data to detecting differentially expressed genes, the key issues are as 
follows. 

Issue 1: Identify suitable microarray datasets 

The first step is to determine which datasets to use regarding the goal of 
analysis. This is done by determining the inclusion-exclusion criteria. The most 
important criterion is that the datasets collected should be in the standard gene 
expression data format, e.g. features by samples [23]. 

Issue 2: Annotate the individual datasets 

Microarray probe design uses short, highly specific regions in genes of interest 
because using the full-length gene sequences can lead to non-specific binding or noise. 
Different design criteria lead to the creation of different probes for the same gene. 
Therefore, one needs to identify which probes represent a given gene within and across 
the datasets. The first option is to cluster probes based on sequence data [11, 17]. A 
sequence match method is especially appropriate for cross-platform data integration, as 
well as Affymetrix cross-generation data integration. However, the probe sequence may 
not be available for all platforms and the clustering of probe sequences could be 
computationally intensive for very large numbers of probes. 
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Alternatively, one can map probe-level identifiers such as Image ClonelD, 
Affymetrix ID, or GenBank accession numbers to a gene-level identifier such as 
UniGene, RefSeq, or LocusLink/EntrezGene. UniGene, which is an experimental 
system for automatically partitioning sequences into non-redundant gene-oriented 
clusters, is a popular choice to unify the different datasets. For example, UniGene Build 
#211 (released March 12, 2008) reduces the nearly 7 million human cDNA sequences to 
124,181 clusters. To translate probe-level identifiers to gene-level identifiers, one can 
use either the annotation packages in BioConductor, or the identifier mapping tables 
provided by NCBI and AffymetrixID for LocusLink/Entrez ID to probe ID, or UniGene 
for probe ID to RefSeq. 

Issue 3: Resolve the many-to-many relationships between probes and genes 

The relationship between probes and genes is unfortunately not unequivocal, 
which means that in some cases a probe may report more than one gene, and vice versa. 
Even using the same Affymetrix platform, the combination of different chip versions 
creates serious difficulties, because the probe identification labels (IDs) are not 
conserved from chip to chip. Therefore, to combine microarray data across studies, a 
unique nomenclature must be adopted and all the different IDs of the chips must be 
translated to a common system. 

It is reasonable that many probe identifiers are mapped onto one gene identifier. 
This is due to the current UniGene clustering and genome annotation, because multiple 
probes per gene provide internal replicates, and allow for poor performance of some 
probes without the loss of gene expression detection, and because microarray chips 
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contain duplicate spotted probes. The issue is when a probe identifier can be mapped to 
many gene identifiers. This may lead to a problem with the further meta-analysis. For 
example, a probe could map to gene X in half of the datasets, but to both genes X and Y 
in the remaining datasets. The further meta-analysis will treat such probes as two 
separate gene entities, failing to fully combine the information for GenelD X from all 
studies. If one simply throws away such probes, valuable information may be lost to 
further analysis. 

Issue 4: Choosing the meta-analysis technique 

The decision regarding which meta-analysis technique to use depends on the 
specific application. In this context, we focus on a fundamental application of 
microarrays: the two-class comparison, e.g. the class of treatment samples and the class 
of control samples, where the objective is to identify genes differentially expressed 
between two specific conditions. 

Let X Nx p represent the expression matrix of selected datasets from multiple 
studies; where N is the number of common genes in the P selected samples. Let K be 
the number of studies from which the samples were selected. Denote by Pk the number 
of samples that belong to the k th study. Let P k c be the number of control samples and 
P k be the number of treatment or experimental samples from the k th study, then 

K 

P k = P k c + P k and P = ^ P k . For Affymetrix oligonucleotide array experiments, we 

k=l 

have P k c chips with gene expression measures from the control class and P k chips with 
gene expression measures from the treatment class. For the two-channel array 
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experiments, we assume that the comparisons of log-ratios are all indirect, i.e., the 
samples from the treatment class are hybridized against a reference sample RA. Then 
the expression values from the k th study are collected into X: 
Xl=log 2 (Ji/R),} = l,...,VZ and X^ q =log 2 (C q /R),q = l,...,P k c . The meta-analysis 

will be performed on the results from analyses that were performed on X. There are four 
common ways to integrate such information across studies. 

Vote counting 

This method simply counts the number of studies where a gene was declared as 
significant [48]. If the number of studies is small, this method can be visualized 
using Venn diagrams. 

Combining ranks 

Unlike vote counting, this technique accounts for the order of genes declared 
significant [30, 52-54]. There are three different approaches to aggregate the 
rankings of, say, the top 100 lists (the 100 most significantly up-regulated or 
down-regulated genes) from different studies [23]. Two of the algorithms use 
Markov chains to convert the pair-wise preference between the gene lists to a 
stationary distribution; the third algorithm is based on an order-statistics model. 

The rank product method, proposed by Breitling et al. [29] is the most popular 
of this approach using the third algorithm. It is a non-parametric statistic which 
is originally used to detect differentially expressed genes in a single dataset. It is 
derived from the biological reasoning of the fold change (FC) criterion and it 
detects the genes that are consistently found among the most strongly up- 



26 



regulated (or down-regulated) genes in a number of replicate experiments. 
However, this method offers a natural way to overcome the heterogeneity 
among multiple datasets and, therefore, can be extended to meta-analysis, which 
generates a single significance measurement for each gene in the combined 
study [30]. Within a given dataset of the study k, the pairwise fold change is 
computed for every gene i,pFC kq =T k /C q , k = l,...,Pj[,q = l,...,P£. There are 

Pk x Pjc suc h ra ti° values for each gene i, let R=Pk x Pk • These ratios are ranked 
within comparison to generate a ranking matrix rFC N: R. The rank product (RP) 
is computed as the geometric mean of rank for each gene i, by 

r r V /R 

RPi= ri rFc J • (zi) 

V r=l J 

RPi/n can be interpreted as a p-value, because it describes the probability of 
observing gene i at rank rFQ or better in the q th comparison, q=l,...,R. A 
permutation procedure is independently carried out to generate all other possible 
combinations of ranks relative to gene i in the dataset, nRPi. Repeat this 
procedure B times, where B is a positive number, to form reference distribution 
with nRPi across R comparisons which is then used to compute the (adjusted) p- 
value and false discovery rate (FDR) for each gene. 

Combining p-values 

In the 1920s, Fisher developed a meta-method that combined the p-values from 
individual studies [117]. In the k th study, for the i th gene, the p-value pik is 
generated by one-sided hypothesis testing. The logs of the p-values of the same 
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gene are summed across K studies using the Fisher's sum of logs method which 
is defined as follows [50, 51], 

S i= -2f>g 2 (p ik ). (2.2) 

k=l 

The Si values are then assessed by comparison against a chi-square 
distribution with 2K degrees of freedom, % 2 2K • 

Rhodes et al. [36] proposed a statistical model for performing meta- 
analysis of four independent prostate cancer microarray datasets from cDNA 
arrays and Affymetrix arrays. Each gene in each study was treated as an 
independent hypothesis and a significance (denoted by one p value and one q 
value) was assigned to each gene in each study based on random permutations. 
The similarity of significance across studies was assessed with meta-analysis 
methods and combined with multiple inference statistical tests for each possible 
combination of studies. A cohort of genes was identified to be consistently and 
significantly deregulated in prostate cancer. Marot et al. [40] used a sequential 
meta-analysis method in which they computed p-values in each sequential step 
and combined p-values from different sequential steps to determine the p-values 
for the results which are either significant genes or biological processes 
throughout the entire analysis process. 

Combining effect-sizes 

The first step is to calculate the effect size dik and the variance 8ik associated 
with the effect size for gene i in the k th study, i = 1,. . .,N; k = 1,. . .,K. 



28 



d ik =^^, (2.3) 



'k 



where S£ is the pooled standard deviation estimated across N genes in the 



th 



dataset of the k study. 



$1 = JX( N . k "1)< /(IX -N) • (2-4) 



i=l / i=l 



Based on the mean differences, an estimated pooled mean difference, |Lii, and its 
variance, af , are computed for each gene across K studies. A z-score zi is then 
calculated using \i{ andaf : 

z i= m/V?- (2.5) 

Effect size can be calculated using the correlation coefficient and the method of 
Cohen [135], which is the difference between the means of two groups 
standardized by its pooled standard deviation [37, 48, 51]. The statistical 
significance can be determined using a comparison against N(0,1) on these z- 
scores. However, if the number of studies, K, is small, there may be a problem 
with over-fitting. In this case, the permutation method is used instead. The 
method from Hedges and Olkin [136] showed that this standardized difference 
overestimates the effect size for studies with small sample sizes. They proposed 
a small correction factor to calculate the unbiased estimate of the effect size, 
which is known as the Hedges' adjustment. The study-specific effect sizes for 
every gene are then combined across studies into a weighted average. The 
weights are inversely proportional to the variance of the study-specific 
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estimates. Choi et al. [37-38] introduced a new meta-analysis method, which 
combines the results from individual datasets in the form of effect size and has 
the ability to model the inter-study variation. The effect size was defined to be a 
standardized mean difference between the test samples and normal samples. The 
effect sizes from multiple microarray datasets were combined to obtain an 
estimate of the overall mean, and statistical significance was determined by the 
permutation test extended to multiple datasets. It was demonstrated that data 
integration using this method promoted the discovery of small but consistent 
expression changes and increased the sensitivity and reliability of the analysis. 
An extended effect size model was then proposed by Hu et al. [39] for meta- 
analysis of microarray data. 

Issue 5: Samples /features selection 

Regardless of the methods of data integration, finding features or samples that 
together represent the biological meaning for the integrated datasets remains a 
challenge. Lin [42] used a heuristics method to detect significant genes from integrated 
data. The underlying spaces were determined using a heuristics search. Then, a set of 
candidate genes lists were identified again using a heuristics search. Parkhomenko et al. 
[43] used principle component analysis (PC A) to determine the most important 
components of the data spaces. Pairs of components are processed one by one using a 
correlation measurements method to look for the 'best' combination of components in 
the dataset. This analysis is also a heuristic-based one. Su et al. [44] used a set of genes 
similar to the gene expression signature targeted to the biological context of interest, to 
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detect candidate genes called VIP (Very Important Pool) genes. These genes control the 
search process to find the significant samples (in the integrated dataset, the samples are 
columns, as we mentioned in Issue 1) and genes from the integrated data. Tan et al. [47] 
approached the problem using GA and a resampling method; the feature and sample 
candidates were selected under a controlled process using GA and the fitness function 
was computed using statistical test results on resample sets. 

Traditional methods to identify similar samples on the basis of gene expression 
values are Principle Component Analysis and Hierarchical Cluster Analysis. These 
methods, however, have the problem of strongly relating samples of the same cell lines 
and batches because of the similarity among cells grown at the same time [25]. In 
addition, these methods require all expression profiles to be generated on the same 
microarray platform, because the determination of distances between expression 
profiles from different platforms is not straightforward, which limits use to the available 
microarray repositories. Lastly, these methods may not benefit from additional 
information regarding the biological context of interest. For such cases, Huttenhower et 
al. [49] proposed a meta-analysis method based on a statistical approach within a 
Bayesian framework. Considering the biological context of the analysis, they measured 
the distance between every pair of genes and constructed a Bayesian framework for 
each specific biological context based on the set of arrays. The Bayesian framework 
was then used to compute the probability of the relevance of each array regarding the 
context. This probabilistic measurement can be used to filter the samples. Because their 
method requires a set of genes targeting the biological context, in addition to a 
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definition by the biologists, the biological context can be defined by a set of genes 
having specific GO (Gene Ontology) terms, or participating in a specific biological 
process, or belonging to a gene expression signature. 

Lamb et al. [25] and Zhang et al. [26], by using gene expression signatures 
proposed pattern-matching, used ranking-based approaches to perform meta-analysis on 
CMAP datasets. Genes in each expression profiles were ranked using fold changes. 
Expression profiles were then integrated based on how they fit a given gene expression 
signature. Their research successfully identified a number of interesting compounds that 
correlated with known biological processes. 

2.3 Cluster analysis 

After meta-analysis, we identify a list of genes differentially expressed in a single 
study or across multiple studies. Using this gene list, we filter out the matrix of 
expression levels of genes significantly expressed under the experimental conditions. 
Cluster analysis then groups genes where genes within the same cluster have similar 
expression patterns across studies. This analysis allows the simultaneous comparison of 
all clusters independently of the platform used and the species studied; similar samples 
under specific biological conditions from different species can also be analyzed together 
by meta-analysis. This comparison allows identification of: i) robust signatures of a 
pathology or a treatment across several independent studies, ii) sets of genes that may 
be similarly modulated in different disease states or following drug treatments, iii) 
common sets of co-expressed genes between human and animal models. Because a 
typical microarray study generates expression data for thousands of genes from a 
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relatively small number of samples, an analysis of the integrated data from several 
sources or laboratories may provide outcomes that may not necessarily be related to the 
biological processes of interest. Cluster analysis will further explore the groups of 
genes that are absolute key agents of the experimental results. 

Given a set of n objects X = fxj, X2,...,x n J, let 9 - {U, V}, where cluster set V = 
{vi, V2,...,v c J and partition matrix U={ukJ, k=l,...,c, i=l,...,n, be a partition of X such 

that X = u£ =1 v k and ^° u ki = l,i = l,...,n . Each subset Vk of X is called a cluster and 

{uki} is the membership degree of {xi} to Vk. Uki e {0,1} if is crisp partition, otherwise, 
u^ e[0,l]. The goal of cluster analysis is to assign objects to clusters such that objects 
in the same cluster are highly similar to each other while objects from different clusters 
are as divergent as possible. These sub-goals create what we call the compactness and 
separation factors that are used, not only for modelling the clustering objectives, but 
also for evaluating the clustering result. These two parameters can be mathematically 
formulated in many different ways that lead to numerous clustering models. 

A dataset containing the objects to be clustered is usually represented in one of 
two formats, the object data matrix and the object distance matrix. In an object data 
matrix, the rows usually represent the objects and the columns represent the attributes of 
the objects regarding the context where the objects occur. The roles of the rows and 
columns can be interchanged for another representation method, but this one is 
preferred because the number of objects is always enormously large in comparison with 
the number of attributes. Here objects are genes and attributes are the expression levels 
of the genes under the experimental conditions. Assume we have n genes and p 
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experimental conditions. The gene data matrix X then has n rows and p columns where 
xy is the expression level of gene i under the condition j and xi is the expression vector 
(or object vector) of gene i across p experiments. The distance matrix contains the 
pairwise distance (or dissimilarity) of objects. Specifically, the entry (i,j) in the distance 
matrix represents the distance between objects i and j, l<i,j<n. The distance of genes i 
and j can be computed using the object vectors i and j from the object data matrix based 
on a distance measurement. However, the object data matrix cannot be fully recovered 
from the distance matrix, especially when the value of p is unknown. 

2.3.1 Clustering algorithms 

Clustering algorithms are classified into either hierarchical or partitioning 

approaches. While the hierarchical methods group objects into a dendrogram (tree 
structure) based on the distance among clusters, the partitioning methods directly 
groups objects into clusters and objects are assigned to clusters based on the criteria of 
the clustering models. 

Clustering methods using a partitioning approach are classified into either 
model-based or heuristic-based approaches. Model-based clustering assumes that the 
data were generated by a model and tries to recover the original model from the data. 
The model obtained from the data then defines the clusters and assigns objects to them. 
Although K-means is a heuristic-based approach, once a set of K centers that are good 
representatives for the data is found, we can consider it as a model that generates the 
data with membership degrees in {0,1} and an addition of noise. In this context, K- 
Means can be viewed as a model-based approach. 
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2.3.2 Hierarchical clustering algorithms 

Hierarchical clustering (HC) builds a dendrogram (hierarchical structure) of 

objects using either agglomerative (bottom-up) or divisive (top-down) approaches. In 
the former approach, the dendrogram is initially empty; each object is in its own sub- 
tree (singleton cluster). The clustering process merges similar sub-trees. At each step, 
the two sub-trees most similar are merged to form a new sub-tree. The process stops 
when the desired number of sub-trees, say c, at the highest level is achieved. In the top 
down approach, the dendrogram starts with a sub-tree that contains all the objects. The 
process then splits the sub-trees into new sub-trees. At each step, the sub-tree of objects 
with maximum dissimilarity is split into two sub-trees. The process stops when the 
number of leaves, say c, or the compactness and separation criteria, is achieved. The 
dissimilarity between every pair of clusters is measured by either the single linkage: 

L s( v k> v i) = min{d(x 1 ,x J );x 1 e v k ,Xj e vj, (2.6) 

the complete linkage: 

L c( v k ? v 1 ) = max{d(x i ,x j );x i ev k ,Xj evj, (2.7) 

or the average linkage: 

Ed( Xl , Xj ) 

L A (v k ,v 1 )= x - eVt r; i l , . (2.8) 



v k v n 



The clustering process results in a set of c sub-trees at the level of interest in 
each approach. In comparison with the partitioning approach, each such sub-tree 
corresponds to a cluster. 
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Because of their methods of grouping or splitting sub-trees, HC algorithms are 
considered heuristic-based approaches. The common criticism for standard HC 
algorithms is that they lack robustness and, hence, are sensitive to noise and outliers. 
Once an object is assigned to a cluster, it will not be considered again, meaning that HC 
algorithms are not capable of correcting prior misclassifications and each object can 
belong to just one cluster. The computational complexity for most HC algorithms is at 
least 0(n 2 ) and this high cost limits their application in large-scale datasets. Other 
disadvantages of HC include the tendency to form spherical shapes and reversal 
phenomenon, in which the normal hierarchical structure is distorted. 

2.3.3 K-Means clustering algorithm 

The K-Means algorithm is a partition approach using a heuristic-based method. 

Its objective function is based on the square error criterion, 

J(U, V | X) = JiXJ^ - v k f , where u kl e {0,1}. (2.9) 

k=l i=l 

K-Means is the best-known square error-based clustering algorithm. Its 
objective, J, is based on both compactness and separation factors. Minimizing J is 
equivalent to minimizing the compactness while maximizing the separation. 
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K-Means algorithm 
Steps 

1) Initialize the partition matrix randomly or based on some prior knowledge. 

2) Compute the cluster set V. 

3) Assign each object in the dataset to the nearest cluster. 

4) Re-estimate the partition matrix using the current cluster set V. 

5) Repeat Steps 1, 2 and 3 until there is no object changing its cluster. 

The K-Means algorithm is very simple and can easily be implemented to solve 
many practical problems. It works very well for compact and hyper-spherical clusters. 
The computational complexity of K-Means is O(cn). Because c is much less than n, K- 
Means can be used to cluster large datasets. The drawbacks of K-Means are the lack of 
an efficient and universal method for identifying the initial partitions and the number of 
clusters c. A general strategy for the problem is to run the algorithm many times with 
randomly initialized partitions, although this does not guarantee convergence to a global 
optimum. K-Means is sensitive to outliers and noise. Even if an object is distant from 
the cluster center, it can still be forced into a cluster, thus, distorting the cluster shapes. 

2.3.4 The mixture model with expectation-maximization algorithm 

Under the probability context, objects can be assumed to be generated according 
to several probability distributions. Objects in different clusters may be generated by 
different probability distributions. They can be derived from different types of density 
functions (e.g., multivariate Gaussian or t-distribution), or the same families, but with 
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different parameters. If the distributions are known, finding the clusters of a given 
dataset is equivalent to estimating the parameters of several underlying models. Denote 

by P(vk) the prior probability of cluster Vk and ^!LP( v k) = l> the conditional 

probability of object Xi given clustering partition = {U,V} is, 

P( Xl |e) = XP( Xi |u k ,v k )xP(v k ). (2.10) 

k=l 

Given an instance of 6, the posterior probability for assigning a data point to a 
cluster can easily be calculated using Bayes' theorem. The mixtures therefore can be 
applied using any type of components. The multivariate Gaussian distribution is 
commonly used due to its complete theory and analytical tractability. 

The likelihood function P(X|6), P(X| 0) = ]Xi P ( x i I ) is the P robabilit y of 
generating X using the model. The best model , therefore, should maximize the log 
likelihood function, L(X | 0) = log(P(X | 0)) = ^ n i log(P(x 1 | 0)) , 

L(X | 0*) = max{L(X | 1 )}. (2.1 1) 

can be estimated using the expectation-maximization (EM) algorithm. EM 
considers X a combination of two parts: X° is the observations of X; X M is the missing 
information of X regarding 0, and X M is similar to U of crisp clustering. The complete 
data log likelihood is then defined as, 

L(x|e) = L(x°,x M |e) = XX x riog[p(v k )x P ( x °|e k ]. (2.12) 

i=l k=l 
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By using an initial value of 0, 0°, EM searches for the optimal value of by 
generating a series of estimates {0°, 1 ,..., T }, where T represents the reaching of the 
convergence criterion. 

EM algorithm 
Steps 

1) Randomly generate a value of 0, 0°, set t=0. 

2) E-step: Compute the expectation of the complete data log likelihood, 
QCM^ECLCXIG)). 

3) M-step: Select a new parameter estimate that maximizes Q(.), 

t+1 =max e { (3(0,0') }• 

4) Set t=t+l. Repeat Steps 2 and 3 until the convergence criterion holds. 

EM assumes that the data distribution follows specific multivariate distribution 
model, for example, a Gaussian distribution. The major disadvantages of the EM 
algorithm are the sensitivity to the selection of the initial value of 0, the effect of a 
singular covariance matrix, the possibility of converging to a local optimum, and the 
slow convergence rate. In addition, because most biological datasets do not follow a 
specific distribution model, EM may not be appropriate. 

2.3.5 Fuzzy clustering algorithm 

The clustering techniques we have discussed so far are referred to as hard or 
crisp clustering, which means that each object is assigned to only one cluster. For fuzzy 
clustering, this restriction is relaxed, and the object xi can belong to cluster Vk, k=l,. . .,c, 
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with certain degrees of membership, Uki, Uki e[0,l]. This is particularly useful when the 
boundaries among the clusters are not well separated, and are therefore ambiguous. 
Moreover, the memberships may help us discover more sophisticated relationships 
between a given object and the disclosed clusters. Fuzzy C-Means (FCM) is one of the 
most popular fuzzy clustering algorithms. FCM attempts to find a fuzzy partition for a 
set of data points while minimizing the objective function as (2-7). By using afuzzifier 
factor, m, FCM is flexible in managing the overlap regions among clusters, hence 



n c 
X y X y „ m II 



j(u,v|x) = 2,2 J u k n ;|x l -v k || 



i=l c=l 



FCM algorithm 



Steps 



1) Randomly initialize the values of partition matrix U°, set t=0. 

2) Estimate the value of v 1 using 



ZKr*. 



k n 



(2.14) 



i=l 

3) Compute the new values of U t+1 that maximally fit with v 1 . 



X; - V, 



m-1 




M X i _V i 

VII l J ll J 



m-1 



4) Repeat Steps 2 and 3 until either {vjj or {u.i} is convergent. 



(2.13) 



(2.15) 



40 



As with the EM and K-Means algorithms, FCM performs poorly in the presence 
of noise and outliers, and has difficulty with identifying the initial parameters. To 
address this issue, FCM is integrated with heuristic-based search algorithms, such as 
GA, Ant colony (AC), and Particle swarm optimization (PSO). The mountain method or 
Subtractive clustering can also be used as an alternative to search for the best initial 
value of V of the FCM algorithm. Due to Possibility Theory limitations, it is difficult to 
recover data from a fuzzy partition model. However, recent research has shown that it is 
possible to construct a probability model using the possibility one. Therefore, by using a 
possibility to probability transformation, it is possible to recover the data from the fuzzy 
partition model. The FCM algorithm is considered to be model-based algorithm. 

The FCM algorithm has advantages over its crisp/probabilistic counterparts, 
especially when there is a significant overlap between clusters [57-59]. FCM can 
converge rapidly and provides more information about the relationships between genes 
and groups of genes in the cluster results. We therefore choose the FCM algorithm for 
the cluster analysis of gene expression data. However, the FCM algorithm also has 
some drawbacks. The objective function can help to estimate the model parameters, but 
cannot distinguish the "best" solution from the numerous possibly local optimum 
solutions. The lack of an effective cluster validation method for FCM prevents its 
application to real world datasets, where the number of clusters is not known. FCM is 
well known for its rapid convergence, but this is not guaranteed to reach the global 
optimum. Because of these issues, further analysis with FCM will be carried out in 
Chapter 3 that will help us develop a novel method. 
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2.4 Gene regulatory sequence prediction 

The cluster analysis process produces groups of genes with similar expression 
patterns. Genes in each group may also have similar regulatory sequences. 
Understanding global transcriptional regulatory mechanisms is one of the fundamental 
goals of the post-genomic era [69]. Conventional computational methods using 
microarray data to investigate transcriptional regulation focus mainly on identification 
of transcription factor binding sites. However, many molecular processes contribute to 
changes in gene expression, including transcription rate, alternative splicing, nonsense- 
mediated decay and mRNA degradation (controlled, for example, by miRNAs, or RNA 
binding proteins). Thus, computational approaches are needed to integrate such 
molecular processes. 

miRNAs bind to complementary sites within the 3 f -UTRs of target mRNAs to 
induce cleavage and repression of translation [175]. In the past decade, several hundred 
miRNAs have been identified in mammalian cells. Accumulating evidence thus far 
indicates that miRNAs play critical roles in multiple biological processes, including cell 
cycle control, cell growth and differentiation, apoptosis, and embryo development. At 
the biochemical level, miRNAs regulate mRNA degradation in a combinatorial manner, 
i.e., individual miRNAs regulate degradation of multiple genes, and the regulation of a 
single gene may be conducted by multiple miRNAs. This combinatorial regulation is 
thought to be similar in scope to transcription factor partner regulation. Therefore, in 
addition to transcription factors and other DNA/RNA-binding proteins, comprehensive 
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investigations into transcriptional mechanisms underlying alterations in global gene 
expression patterns should also consider the emerging role of the miRNAs. 

Motif finding algorithms are designed to identify transcription factor binding sites 
(TFBS) and miRNA complementary binding sites (MBS). These sequences are usually 
short, ~ 8bp and 23bp, respectively. In addition, each gene may have zero, one, or more 
such binding sites (BS). These BSs or motifs may be over-represented differently in 
DNA and RNA sequences of genes in the same group. Motifs are difficult to recognize 
because they are short, often highly degenerate, and may contain gaps. Although motif- 
finding is not a new problem, challenges remain because no single algorithm both 
describes motifs and finds them effectively. Popular methods for motif-finding use a 
position specific score matrix (PSSM) that describes the motifs statistically, or 
consensus strings that represent a motif by one or more patterns that appear repeatedly 
with a limited number of differences. Of the two, PSSM is preferred because it is more 
informative and can easily be evaluated using statistical methods. In addition, a 
consensus motif model can be replaced by a PSSM one. 

One of the most popular motif- finding algorithms using PSSM is MEME proposed 
by Bailey and Elkan [70]. The advantage of MEME is that it uses expectation- 
maximization (EM) which, as a probability-based algorithm, produces statistically 
significant results if it can reach a global optimum. The disadvantage is that for motif 
seeds, it uses existing subsequences from within the sequence set and, as a result, may 
fail to discover subtle motifs. Chang et al. [72] and Li et al. [76] overcame this 
drawback by using the genetics algorithm (GA) to generate a set of motif seeds 
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randomly. However, because they used GA with random evolution processes, a rapid 
convergence to a solution is not assured. Li et al. [77] improved on MEME by using one 
instance of a position weight matrix (PWM), a type of PSSM, to represent a motif, and 
statistical tests to evaluate the final model. However, because EM may converge to 
local optima, use of a single PWM may fail to find a globally optimal solution. The GA- 
based methods of Wei and Jensen [83] and Bi [71] use chromosomes to encode motif 
positions. The method in the former [83] is appropriate for models with zero to one 
occurrence of the motif per sequence (ZOOPS); the latter [71] is appropriate for models 
with one occurrence per sequence (OOPS), because one variable can be used to 
represent the single motif occurrence in each sequence. However, Li et al. [77] recently 
showed that ZOOPS and OOPS are inadequate when not every sequence has the same 
motif frequency, and that the two-component mixture (TCM) model, which assumes a 
sequence may have zero or multiple motif occurrences, should be used. However, TCM 
requires a set of variables for every sequence to manage motif positions, and hence, the 
size of a chromosome can approach the size of the dataset [145]. Lastly, the above 
algorithms are restricted to finding gapless motifs and, therefore, will fail to find many 
functionally important, gapped motifs. While some methods, e.g., pattern-based 
methods of Pisanti et al. [81] and Frith [73], allow gapped motifs, they require the 
gapped patterns to be well-defined and they generate gap positions randomly or by 
using a heuristic method. Alternatively, Liu et al. [78] used neural networks to find 
gapped motifs, but their approach required a limited and specific definition of the neural 
network structure. 
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2.5 Datasets 

In order to evaluate the previous studies as well as our proposed methods for 
microarray analysis, we will use multiple data types with different levels of complexity. 
Artificial datasets are used in both clustering and motif-finding and are preferred for 
testing methods because of the ease in generating and running benchmarks. For cluster 
analysis, we generated datasets using a finite mixture model of data distribution [27], 
because known cluster structures are ideal for evaluating the capability of an algorithm 
for determining the number of clusters and the cluster prototypes. For sequence 
analysis, we generated motifs in the set of randomly generated DNA sequences using 
the normal distribution model [71] with a known number of motif occurrences and 
locations in every sequence. Some popular datasets in the Machine Learning 
Repository, University of California Ivrine (UCI) [84], are also used to show how 
effective our methods are for cluster analysis using real datasets. The most complex 
datasets, e.g., gene expression data, including CMAP datasets, are the most challenging 
of clustering problems. Lastly, for sequence analysis, we used eight DNA transcription 
factor binding site datasets; two eukaryotic datasets: ERE and E2F [71], and six 
bacterial datasets: CRP, ArcA, ArgR, PurR, TyrR and IHF [80]. 

2.5.1 Artificial datasets for cluster analysis 

Artificial datasets were generated using a finite mixture model of data distribution 

[27] with different numbers of data objects, clusters, and dimensions. The clusters were 
generated with a slight overlap (overlapping ratio = 0.1). For test purposes, we selected 
five datasets, named from ASET1 to ASET5, where the first four datasets are uniform 
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and the last one is non-uniform. Datasets of different dimensions support the stability of 
our methods. We expect our methods to properly evaluate clustering results, 
successfully detect the correct number of clusters, properly locate the cluster centers, 
and to correctly assign data points into their actual clusters. 

2.5.2 Artificial datasets for sequence analysis 

We generated simulated DNA datasets as in Bi [71], with three different 

background base compositions: (a) uniform, where A, T, C, G occur with equal 
frequency, (b) AT-rich (AT = 60%), and (c) CG-rich (CG = 60%). The motif string, 
GTCACGCCGATATTG, was merged once or twice into each sequence, after a defined 
level of the string change: (i) 9% change representing limited divergence (i.e. 91% of 
symbols are identical to the original string), (ii) 21% change, or (iii) 30% change which 
is essentially background or random sequence variation. 

2.5.3 Real clustering datasets 

Some popular real datasets from the UCI Machine Learning Repository [84] were 

used to test our method. 

Iris data 

The Iris dataset contains information about the sepal length, the sepal width, the 
petal length, and the petal width of three classes of Iris flowers; setosa, versicolor and 
virginica. Each class has 50 data objects from a total of 150 data objects in the dataset. 
This dataset does not contain missing values. The expected results should have the three 
classes with their own data objects. 
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Wine 

This dataset contains 178 data objects, each with 13 attributes corresponding to the 
measures of alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols, 
flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 
of diluted wines and proline. The data objects are known to belong to three different 
classes. There are no missing values in the dataset. We expect to detect three clusters in 
the dataset and properly assign the data objects to their classes. 

Glass 

This dataset contains 214 data objects, each having ten attributes of which nine are 
the glass characteristics of refractive index, sodium, magnesium, aluminum, silicon, 
potassium, calcium, barium, and iron content. The last attribute, a value ranging from 1 
to 6, is for glass identification. The dataset with the first nine attributes is used for 
clustering and the results should contain six clusters and assign the data objects to their 
proper class identification. 

Breast Cancer Wisconsin 

The Breast Cancer Wisconsin (BCW) dataset contains 699 data objects with nine 
attributes regarding the clump thickness, uniformity of cell size, uniformity of cell 
shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, 
normal nucleoli, and mitoses. The objects of the datasets are from two different classes 
of tumor state: benign and malignant. The number of classes in this dataset is unknown. 
Recent research has concluded that 6 is a reasonable number of classes. We expect to 
discover 6 clusters in the dataset using our methods. 
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2.5.4 Gene expression datasets 

Yeast and Yeast-MIPS data 

The yeast cell cycle data showed expression levels of approximately 6000 genes 
across two cell cycles comprising 17 time points [85]. By visual inspection of the raw 
data, Cho et al. [85] identified 420 genes that show significant variation. From the 
subset of 420 genes, Yeung et al. [102] selected 384 genes that achieved peak values in 
only one phase, and obtained five standard categories by grouping into one category 
genes that peaked during the same phase. Among the 384 selected genes, Tavazoie et al. 
[118], through a search of the protein sequence database, MIPS [86], found 237 genes 
that can be grouped into 4 functional categories: DNA synthesis and replication, 
organization of centrosome, nitrogen and sulphur metabolism, and ribosomal proteins. 
The functional annotations show the cluster structure in the dataset. We named these 
two subsets Yeast and Yeast-MIPS respectively. We expect clustering results to 
approximate the five class partition in the former and a four class partition in the latter. 

RCNS - Rat Central Nervous System data 

The RCNS dataset was obtained by reverse transcription-coupled PCR designed to 
study the expression levels of 1 12 genes over nine time points during rat central nervous 
system development [87]. Wen et al. [87] classified these genes into four functional 
categories based on prior biological knowledge. These four classes are external criterion 
in this dataset. 
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2.5.5 CMAP datasets 

CMAP datasets contain expression profiles from multiple drug treatments 

performed at different concentrations. Originally there were 164 distinct small- 
molecules used in 453 different treatment instances resulting in 564 expression profiles. 
These datasets have been used to detect associations between genes and drugs. Lamb et 
al. [25] and Zhang et al. [26] were successful in recovering and exploiting some gene- 
drug associations using gene expression signatures with matching-and-ranking based 
methods. These datasets will be used in our approach to demonstrate that we can 
replicate the results of Lamb et al. [25] and Zhang et al. [26]. In addition, more gene 
expression signatures will be discovered and the associations between DNA motifs and 
drugs will be established using our methods. By discovering novel gene expression 
signatures and related pathways for the drugs, we expect to exploit the desired effects as 
well as the side-effects of drug treatments. Additionally, by using our motif-finding 
methods, we hope to predict some of the associations between drugs and DNA 
sequences and the connections between miRNAs and diseases. 

2.5.6 Real biological sequence datasets 

A. Eukaryotic transcription factor binding sites datasets 

E2F dataset 

E2F transcription factors play a key role in the regulation of cell cycle-associated 
genes. Identifying target genes for these factors is a major challenge. We used the set of 
25 sequences from Kel et al. [35] that contain 27 instances of the E2F transcription 
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factor binding site. The goal is to discover motif occurrences that locate exactly or 
overlap with binding sites at known positions. 

ERE dataset 

The estrogen receptor (ER) is a ligand-activated enhancer protein that is a member 
of the steroid/nuclear receptor superfamily. Two genes encode mammalian ERs, 
ERalpha and ERbeta. In response to estradiol (E2), ER binds with high affinity to 
specific DNA sequences called estrogen response elements (EREs) and transactivates 
gene expression. We used the set of 25 sequences from Klinge [14] each of which 
contains an instance of the ERE. The purpose is to detect the motif sequences located 
exactly at or overlapping with EREs. 

B. Bacterial transcription factor binding sites datasets 

CRP dataset 

CRP is one of seven "global" transcription factors in E. coli known to regulate 
more than 100 transcription units. CRP's activity is triggered in response to glucose 
starvation and other stresses by the binding of the second messenger cAMP. CRP 
binding sites have proved to be particularly noisy because computational searches for 
consensus binding sites have missed many known binding sites. CRP was chosen for its 
highly indiscriminate binding site [115]. 

ArcA dataset 

ArcA is a global regulator that changes in relation to the expression of fermentation 
genes and represses the aerobic pathways when E. coli enters low oxygen growth 
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conditions. Arc A was chosen for its different protein domain (CheY-like) and very low 
consensus binding site [115]. 

ArgR dataset 

ArgR, complexed with L-arginine, represses the transcription of several genes 
involved in biosynthesis and the transport of arginine, histidine, and its own synthesis, 
and activates genes for arginine catabolism. ArgR is also essential for a site-specific 
recombination reaction that resolves plasmid ColEl multimers to monomers and is 
necessary for plasmid stability. 

PurR dataset 

PurR dimers control several genes involved in purine and pyrimidine nucleotide 
biosynthesis and its own synthesis. This regulator requires binding of two products of 
purine metabolism, hypoxanthine and guanine, to induce the conformational change that 
allows PurR to bind to DNA. 

TyrR dataset 

TyrR, tyrosine repressor, is the dual transcriptional regulator of the TyrR regulon 
that involves genes essential for aromatic amino acid biosynthesis and transport. TyrR 
can act both as a repressor and as an activator of transcription at a 70 -dependent 
promoters. 

IHF dataset 

IHF, Integration Host factor, is a global regulatory protein that helps maintain DNA 
architecture. It binds to and bends DNA at specific sites. IHF plays a role in DNA 
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supercoiling and DNA duplex destabilization and affects processes such as DNA 
replication, recombination, and the expression of many genes. These sequence sets were 
chosen because of their highly degenerate binding sites. 

C. Protein sequences 

Sequences from protein families in PFAM and Prosite databases with known motifs 
are used to evaluate the performance of our method. The DUF356, Strep-H-triad and 
Excalibur (extracellular calcium-binding) protein families have unambiguous motifs. 
Those from Nup-retrotrp and Flagellin-C families have weak signals. Sequences from 
these families are also used to evaluate the capability of detecting subtle motifs. In 
addition to these families, sequences from Xin Repeat and Planctomycete Cytochrome 
C protein families are used because they have more variable motifs. For gapped motif 
examples, we used sequences from the zf-C2H2, EGF2, LIGCYCLIN1, LIG_PCNA 
and MODTYRITAM families. To test the ability of de novo motif prediction, we 
used sequences from ZZ, Myb and SWIRM domains. Because motifs of these domains 
are unknown, the results from our method are compared with other motif-finding 
algorithms and we expect that our methods will agree with the results of others as 
shown in Chapter 4. 
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3. Fuzzy cluster analysis using FCM 

3.1 FCM algorithm 

The FCM algorithm was initially developed by Dunn [121] and generalized later by 
Bezdek [92] with an introduction of the fuzzifier, m, is the best well-known method for 
fuzzy cluster analysis. FCM allows gradual memberships of data points to clusters 
measured as degrees in [0,1]. This provides the flexibility in describing the data points 
that can belong to more than one cluster. In addition, these membership degrees provide 
a much finer picture of the data model; they afford not only an effective way of sharing 
a data point among multiple clusters but also express how ambiguously or definitely the 
data point belongs to a specific cluster. 

Let X = {xi, X2,..., x n } be the dataset in a p-dimensional space R p where each xi = 
(xn, Xi2,..., Xi P ) g R p is a feature vector or pattern vector, and each Xy is the j th 
characteristic of the data points xi. Let c be a positive integer, 2 < c < n, R cn denote the 
vector space of all real c x n matrices. Let V = {vi, V2,. • ., v c } e R cp of which Vk e R p is 
the center or prototype of the k th cluster, 1 < k < c. 

The concept of the membership degrees is substantiated by the definition and 
interpretation of fuzzy sets. Fuzzy clustering, therefore, allows got a fine grained 
solution space in the form of fuzzy partitions of the dataset. Each cluster Vk in the 
cluster partitions is represented by a fuzzy set |Lik as follows: complying with fuzzy set 
theory, the cluster assignment Uki is now the membership degree of the data point xi to 
cluster Vi, such that u^ = |Uk(xO e [0,1]. Because memberships in clusters are fuzzy, 
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there is not a single cluster label assigned to each data point. Instead, each data point xi 
will be associated by a fuzzy label vector that states its memberships in c clusters, 

u .i=( u ii> u 2i>---> u ci} T - C 3 - 1 ) 

Definition 3.1 (fuzzy c-partition): Any matrix U = {uki} e M fcn defines a fuzzy c- 
partition (or fuzzy partition), where 

M fcn = |u e R cn | u kl e [0,1] Vi,k;Ju kl = 1 Vi;0 < ]Tu kl < n Vkj . (3.2) 

I k=l i=l J 

Row k 1 of a matrix U e M fcn exhibits the k th membership function Uk in the fuzzy 
partition matrix U. The definition of a fuzzy partition matrix (3.2) restricts assignment 
of data to clusters and what membership degrees are allowed. 

Definition 3.2 (fuzzy c -means Junctionals): Let J m : M cn x R cp — > R + be, 

J m (U,V) = XXuSd 2 (x i ,v k )^min, (3.3) 

i=l k=l 

where m is the fuzzifier factor, l<m<oc, also called the weighting exponent, which 
controls the degree of fuzziness of the clustering; J m (U,V) is the infinite family of 
objective functions, and d(.) is any inner product norm of R p : 



d 2 (x„v k )= x,-v k \ =( X] -v k ) T A(x 1 -v k ), (3.4) 



9 9 

where A is a positive definite matrix. If A = I pxp then d (x,y) = ||x - y|| . 



Theorem 3.1 (prototypes oj FCM): If m and c are fixed parameters, and I, I are sets 
defined as: 
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Vi,l<i<n 



fl i ={k|l<k<c;d(x i ,v k ) = 0}, 



(3.5) 



_] i ={l,2,...,c}/I i 

then (U,V) e (M fcn x R cp ) may be a global minimum for J m (U,V) with respect to (w.r.t) 
the restriction 



I2>« =1 ' 



l<i<n 



k=l 



as in (3.2), only if: 



and 



V u ki = <^ 

l<i<n kl 
l<k<c 



1 



v d 2 ( X „v k ) y 
[0, k £ I, 
ll, kel ' 



m-1 / c 



1=1 



d 2 (x i5Vl ) y 



m-1 



I i= 
L*0 



v v k =z< x 7z<- 



l<k<c 



i=l 



i=l 



(3.6) 



(3.7) 



(3.8) 



Proof. We consider minimizing the J m w.r.t. U under the restriction (3.6) using 
Lagrange multiplier. Let the Lagrange multiplier be A,i, i=l,. . .,n and put 



L=II<d 2 (x l ,v k )-2> 1 £> kl -i). 



i=l k=l 



i=l k=l 



As the necessary optimal condition, 



8L 
du Vi 



= m(ur)d 2 (x i ,v k )-X 1 =0, 



(3.9) 
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dL 



dv 



= 22uS|x I -v k | = 0. 



(3.10) 



k i=l 



Assume that k £ l{ then d (xi, Vk) * 0. We derive from the Equation (3.9) for each k: 



i 



i 



K 



md 2 (x ; ,v k ) y 



m_i i% i m_i 

K m) 



V d2 ( x .' v k)y 



m-l 



l<k<c. (3.11) 



Summing up for k=l,. . .,c and taking (3.6) into account, we have, 



X : V" 1 ' 



z«h- z 

k=l V m / k=l 



1 



v d (x.,v k ), 



= 1. 



Hence, 



00 



m 



m-l 



Viny 



z 

k=l 



1 



d 2 (x ; ,v k ) 



m-l 



Together with (3. 1 1), we obtain the following for the membership, 



u ki = 



1 



^ 



1=1 



V-i v 



d 2 ( Xi ,v k ) 



m-l 



v d 2 ( Xl , Vl ) y 



Thus, 



u kl = 



v d2 ( x .> v k)y 



m-l / c 



1=1 



^m-1 



v d 2 (x 1 ,v 1 ) y 



Similarly, we derive from (3.10) for each k, l<k<c: 



v k =5>s*,/5>s 



i=l 



i=l 



Q.E.D 
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3.2 FCM convergence 

The iteration process of the FCM algorithm can be described using a map: 

3 m : (U, V) -> (U, V) where U = F(V) and V = G(U) . 

(U w , V w ), t=l,2... is called an iteration sequence of the FCM algorithm. If 

(U (t) ,V (t) ) = 3 m (U (t - l \V (t - l) ), t>l, where (U (0) ,V (0) ) is any element of M fcn x R cp . 



Set 



a = I(u\ y% m- x R - J >\V)^ J >, v> ) vu - m-,u * v'M 

1 J,(U.V)<J„(U*,V) W€R',V*V* 



Theorem 3.2 (descent function - Bezdek) [133]: J m is a descent function for {3 m , Q} 
Proof: Because {y -^ d 2 (y) } and {y -> y m } are continuous, J m is the sum of products of 
such functions, J m is continuous on {M fcn , R cp }. 

J m (3 m (U,V)) = J m (F(G(U)), G(U)) 

< J m (U,G(U)) by the first case in the definition of Q 

< J m (U, V) by the second case in the definition of Q. 

Thus, J m is a descent function for {3 m , Q}. Q.E.D 

Theorem 3.3 (solution convex set): Let [conv(X)] c be the c-fold Cartesian product of the 
convex hull of X, and let (U°, V°) be the starting point of the sequence of 3 m iteration, 

U° e M fcn and V° = G(U°). Then 3 m (U (t " 1} ,V (t " 1} ) e M fcn x [conv(X)] c , 1=1,2. . . 

M fcn x[conv(X)] c is compact in M fcn xR cp . 

Proof Let U° e M fcn be chosen. For each k, 1 < k < c, we have 
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i=l / i=l 



Let p kl = (KiT/ll( u liT > we have v k ^PA , and £p ki = 1 



i=l i=l i=l 



Thus v° k g conv(X) , and therefore V° e [conv(X)] c . 

Subsequently, we can prove that U (t) g M fcn , V (t) g [conv(X)] c , Vt > 1. Q.E.D 

Theorem 3.4 (the convergence of FCM - Bezdek et al.) [134]: Let 0° = (U°, V°) be the 
starting point of the iterations with 3 m . The sequence 3^(U 5 V°) 5 t = 1,2... either 
terminates at an optimal point = (U , V ) g Q, or there is sub-sequence converging to 
a point in Q. 

Proof: 3 m is a continuous and descent function (Theorem 3.2) and the iteration 
sequences are always in a compact subset of the domain of J m (Theorem 3.3). Thus, 3 m 
should terminate at an optimal point = (U ,V ) or there is a sub-sequence converging 
to a point in Q. Q.E.D 

3.3 Distance measures 

Similar to other clustering algorithms, FCM uses the distance or dissimilarity of 
data points, each described by a set of attributes and denoted as a multidimensional 
vector. The attributes can be quantitative or qualitative, continuous or discrete, which 
leads to different measurement mechanisms. Accordingly, for a dataset in the form of 
an object data matrix, the data matrix is designated as two-mode, because its row and 
column indices have different meanings. This is in contrast to the one-mode distance 
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matrix, which is symmetric with elements representing the distance or dissimilarity 
measure for any pair of data points, because both dimensions share the same meaning. 

Given a dataset X, the distance function d(.) on X is defined to satisfy the 
following conditions, 

Symmetric: d(x,y) = d(y,x), 

Positive: d(x,y) > 0, V x,y eX. 
If the following conditions hold, then d is a metric. 

Triangle inequality: d(x,z) < d(x,y) + d(y,z), Vx,y,zeX, 

Reflexive: d(x,y) = iif x = y. 
If the triangle inequality is violated, then d is a semi-metric. 
Many different distance measures have been used in cluster analysis. 

Euclidean distance 

The Euclidean distance is the most widely used distance measure and defined as 

d 2 (x,y) = X(x 1 -y 1 ) 2 . (3.12) 

i=l 

This distance is a true metric because it satisfies the triangle inequality. In the 
Equation (3.12), the expression data Xi and yi are subtracted directly from each other. 
We, therefore, need to ensure that the expression data are properly normalized when 
using the Euclidean distance, for example by converting the measured gene expression 
levels to log-ratios. 
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Pearson correlation distance 



The Pearson correlation coefficient distance is based on the Pearson correlation 



coefficient (PCC), which describes the similarity of objects as 



PCC(x ? y)^f (X --" l)(y --^ ) 

P !=1 °x°y 



where a x and a y are the sample standard deviation of x and y respectively. PCC has a 
value from -1 to 1, where PCC=1 when x and y are identical, PCC=0 when they are 
unrelated, and PCC = -1 when they are anti-correlated. The Pearson's correlation 
distance is then defined as 



d (x,y) = l-PCC(x,y). 



(3.13) 



Euclidean 



Correlation 




Figure 3-1 : Expression levels of three genes in five experimental conditions 
The value of d p lies in [0,2] where d p =0 implies that x and y are identical, and 
d p =2 implies that they are very different from each other. While the Euclidean measure 
takes the magnitude of the data into account, the Pearson correlation measure 
geometrically captures the pattern of the two data points [1]. It is therefore widely used 
for gene expression data. Even if the expression levels of the two genes are different, if 
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they peak similarly in the experiments, then they have a high correlation coefficient, or 
a small Pearson correlation distance (Figure 3-1). Because of its definition, the Pearson 
correlation distance satisfies the distance conditions, but it is not a metric since it 
violates the triangle inequality. 

Absolute Pearson correlation distance 

The distance is defined as, 

d ap (x,y) = l-|PCC(x,y)|. (3.14) 

Because the absolute value of PCC falls in the range [0,1], the d ap distance also 
falls between [0,1]. d ap is equal to 1 if the expression levels of the two genes have the 
same shape, i.e., either exactly the same or exactly opposite. Therefore, d ap (3.14) 
should be used with care. 

Uncentered Pearson correlation distance 

The distance is based on the uncentered Pearson correlation coefficient (UPCC) 
which is defined as 



UPCC(x,y) = ±£^ 

P i=l °x ^y 



where o° z = ^ zf /p . The uncentered Pearson correlation distance is defined as 



i=i 



d up (x,y) = l-UPCC(x,y). (3.15) 

The d up distance in Equation (3.15) may be appropriate if there is a zero 
reference state. For instance, in the case of gene expression data given in terms of log- 
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ratios, a log-ratio equal to corresponds to green and red signal being equal, which 
means that the experimental manipulation did not affect the expression. Because UPCC 
lies in the range [-1,1], the d up distance falls between [0,2]. 

Absolute uncentered Pearson correlation distance 

This distance is defined as in (3.16). Because the absolute value of UPCC falls 
between [0,1], the d aup distance also falls between [0,1]. 

d aup (x,y) = l-|UPCC(x,y)|. (3.16) 

Comparison of measures 

The distance measure plays an important role in obtaining correct clusters. For 
simple datasets, where the data is multidimensional, the Euclidean distance measure is 
employed. But as the dimensions of the dataset increase, where each dimension denotes 
a specific aspect of the dataset, the Euclidean distance measure may not be the best one 
to be used. 

For the Iris dataset, we ran the FCM algorithm with the number of clusters set to 
3, which is the number of classes of Iris flowers: Setosa, Versicolor, and Virginica in 
the dataset. By using five different distance measures from (3.12) to (3.16), we obtained 
five different cluster partitions. For each partition, we compared the cluster label of 
each data point with its class label to compute the correctness ratio for every class. The 
results are shown in Table 3.1. While the four distance measures using Pearson's 
correlation coefficient provide correct results, the Euclidean distance measure performs 
worse, particularly on the Virginica class. In this comparison, it is clear that the absolute 
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Pearson correlation measure and the Pearson correlation measure results are identical. 
Similarly, the result of the uncentered Pearson correlation measure is identical with that 
of the absolute uncentered Pearson correlation measure. 

Table 3.1 : Performance of different distance measures on the Iris dataset 



Distance measure method 


Classification correctness (%) 


Average 
Correctness 


Setosa 


Versicolor 


Virginica 


Euclidean 


100.00 


94.00 


76.00 


90.00 


Pearson correlation 


100.00 


94.00 


94.00 


96.00 


Abs. Pearson correlation 


100.00 


94.00 


94.00 


96.00 


Uncentered Pearson cor. 


100.00 


92.00 


100.00 


97.33 


Absolute uncentered Pearson cor. 


100.00 


92.00 


100.00 


97.33 



Table 3.2 : Performance of different distance measures on the Wine dataset 



Distance measure method 


Classification correctness (%) 


Average 
Correctness 


#1 


#2 


#3 


Euclidean 


76.27 


70.42 


56.25 


67.65 


Pearson correlation 


91.53 


85.92 


100.00 


92.48 


Abs. Pearson correlation 


72.88 


67.61 


95.83 


78.77 


Uncentered Pearson cor. 


91.53 


85.92 


97.92 


91.79 


Absolute uncentered Pearson cor. 


91.53 


85.92 


97.92 


91.79 



We ran a similar benchmark on the Wine dataset which contains data objects of 
the three different classes of wines. Results are shown in Table 3.2. Again, all Pearson 
correlation based distance measures performed better than the Euclidean distance 
measure. The Pearson correlation measures seem to be better for distance measure in 
real and unsealed datasets. 
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3.4 Partially missing data analysis 

In cluster analysis, complete information is preferred throughout the experiment. 
Unfortunately, real world datasets frequently have missing values. This can be caused 
by errors that lead to incomplete attributes or by random noise. For example, sensor 
failures in a control system may cause the system to miss information. For gene 
expression data, the missing values can come from the platform level and meta-analysis 
level. With the former, the reasons include insufficient resolution, image corruption, 
spotting, scratches or dust on the slide, or hybridization failure. The latter can be caused 
by differences in platforms or chip generations, and the lack of an appropriate method 
to map probes onto genes across different microarray platforms. 

In a dataset with partially missing data, some attribute values of a data point xi 
may be not observed. For example, xi = (xn, Xi 2 , ?, ?, Xi 5 , ?) has missing values 
corresponding to the third, fourth and sixth attributes and only the first, second and fifth 
attributes are observed. This causes a problem with distance measurement between xi 
and other data points in the dataset and the cluster algorithm, therefore, cannot perform 
properly at xi. Let 

Xw = { xi g X| xi is a complete data point }, 

X P = { xij | Xi^ ? }, and 

X M = { Xij | Xij = ? } . 

Clearly, X P and X contain more information than X w [140]. 
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Three approaches [137-139] have been widely used to address the problem of 
missing values. (1) The ignorance-based approach is the most trivial approach to deal 
with datasets when the proportion of incomplete data is small, |X / X w | (( |X|, but the 
elimination brings a loss of information. (2) The Model-based approach defines a model 
for the partially missing data, X M , using available information, X P , and applies the 
model to build a complete dataset. However, the complexity of this method can prevent 
application to large datasets. (3) The Imputation-based approach supplies missing 
values X M by certain means of approximation. Of the three, the Imputation-based 
approach is preferred. Because of the need to use an approximation model for the 
missing data, it is therefore integrated with the cluster algorithm and works based on the 
cluster model, which may be a probability-based or possibility-based one, estimated by 
the cluster algorithm. 

For the Imputation-based approach using FCM, Hathaway et al. [140] identified 
four strategies to solve the problem of missing values. 

Whole data strategy (WDS) 

This strategy, similar to the ignorance-based approach, eliminates the missing 
data. Results therefore present the cluster partition of X w instead of X. This strategy is 
applicable when |X P | / |X| > 0.75. 

Partial distance strategy (PDS) 

This is the strategy recommended when X M is large. The distance between every 
pair of data points is computed using X P . 
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d^(x i ,v k ) = |x i -v k | AIi =(x i -v k ) 1 (AI i )(x i -v k ), (3.17) 

where Ii is the index function of the data point xi, defined as in (3.18). 

fO, x, eX M 
h=\: " V M forl<i<n,l<t<p. (3.18) 

[1, x it eX p 

Optimal completion strategy (OCS) 

In addition to model parameters, this strategy estimates X M . The FCM algorithm 
is modified to meet this method: in the first iteration, X w is used instead of X, and X M is 
estimated in the favor of optimizing the J m function using (3.19). FCM is then run using 
the entire dataset X until convergence. 



I< 



; XV,, 



*,=—< ,x lje X M . (3-19) 

k=l 

Nearest prototype strategy (NPS) 

The nearest prototype strategy is in fact a version of the k-nearest neighbors 
strategy [89] for fuzzy partition. This is also considered the simple version of OCS. In 
each iteration, X M is updated as, 

x u =v kj , where x {j eX M ,dJ(x i ,v) = mjdJ(x i ,v 1 )). (3.20) 

Algorithm performance measures 

Hathaway et al. [140] proposed two measures for performance evaluation of 
imputation algorithms. The misclassification measure (MISC) is computed based on the 
assigned cluster label and the actual class label of the data objects, defined as, 
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MISC(X,V) = £l m (C(x i ),V(x i )), (3.21) 



i=l 



where C(xi) and V(xi) are the actual class and the assigned cluster labels of the data 
object xi respectively. I m is the index function, defined as, 

fl, x^y 

i m (x,y)= : y . 

[0, x = y 

The cluster prototype error (PERR) measures the difference between the predicted and 
the actual cluster prototypes of the dataset; it is defined as, 



PERR(V) = 



r c A 

Z d ( v k> v I) 



7 2 



(3.22) 



Vk=l 

where Vk, the k th predicted cluster prototype, and v^ , the k th actual cluster prototype, 
k=l,...,c. We applied the four strategies discussed above to the Iris dataset [140] using 
different scales of the missing values, X M . The results are shown in Table 3.3. 

The dataset contains 150x4 = 6000 attribute values. X M was generated randomly 
at various scales from to 0.75. Each method was run 50 times. In Table 3.3, columns 
2-5 show the average number of iterations to convergence, columns 3-6 show the 
misclassification errors, and columns 7-10 show the mean prototype errors, computed 
based on the distances between the estimated prototypes and the real prototypes of the 
dataset for the four methods. 

Based on the errors in columns 7-10, the accuracy of prototype estimation by 
OCS was always at least as good as that of other approaches. The WDS performed well 
for datasets with low percentages of missing values. The second best method was NPS. 
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Table 3.3 : Average results of 50 trials using incomplete Iris data [140] 



% 

Missing 


Mean Number of Iterati 
to Termination 


ons 


Mean Number of Misclassifications 
(Training Error) 


Mean Prototype Error 


WDS 


PDS 


OCS 


NPS 


WDS 


PDS 


OCS 


NPS 


WDS 


PDS 


OCS 


NPS 





I 


1 


1 


1 


16 


16 


16 


16 


0.00 


0.00 


0.00 


0.00 


10 


18.18 


15.64 


22.20 


16.00 


16.04 


16,24 


16.28 


16.28 


0.17 


0.07 


0.07 


0.07 


20 


20.74 


17,24 


22,64 


18.12 


16.42 


16.62 


16.48 


16.54 


0J3 


0.12 


0.11 


0.12 


30 


22.68 


17.84 


29.90 


20.42 


19,00 


17.04 


16.72 


16.98 


0.49 


0.15 


0.14 


0.15 


40 


23,62 


18.48 


29.02 


22.40 


22.46 


18.76 


18.82 


18.84 


0,78 


0.18 


0.17 


0.18 


50 


14,80 


20.14 


33-06 


29.08 


42.64 


2L44 


21,14 


21.36 


2.11 


0-25 


0.21 


0-24 


15 


N/A 


47,20 


68,92 


7078 


N/A 


33.44 


29.12 


30.54 


N/A 


0.74 


0.39 


0,52 



3.5 FCM Clustering solution validation 

A fundamental and the most difficult problem in cluster analysis, cluster 
validation, is to determine whether the fuzzy partition is of good quality. Cluster 
validation addresses the following questions: 

i. If all existing algorithms are applied to the dataset, and one obtains a multitude 
of different partitions, which assignment is correct? 

ii. If the number of clusters is unknown, partitions can be determined for different 
numbers of clusters. Which partition (or number of clusters) is correct? 

iii. Most algorithms assume a certain structure to determine a partition, without 

testing if this structure really exists in the data. Does the result of a certain 

clustering technique justify its application? 

The cluster validation problem is the general problem of determining whether the 

underlying assumptions (cluster shapes, number of clusters...) of a clustering algorithm 

are satisfied for the dataset. There are three different approaches for cluster validation: 
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a) Using hypothesis testing on the clusters, 

b) Applying model selection techniques to the cluster model, 

c) Using cluster validity measures. 

In the first two methods, statistical tests with assumption on probabilistic data 
distribution model are used. Hence, cluster validation is performed based on the 
assumed distribution. While the first approach, using hypothesis testing, is applied to a 
single model that we consider the local validation approach (3.5.2), the approach using 
model selection technique, similar to the global validation approach (3.5.1), takes into 
account a family of probabilistic models according to a set of partitions from which the 
best should be selected. 

FCM is a possibility-based approach. It is similar to the EM algorithm with the 
mixture model, which is a probability-based approach. Compared with FCM, EM, with 
a standard model assumption about the data distribution, has the disadvantage that it 
converges slowly. However, EM, because of its assumed statistical model, has the 
advantage of using the first two approaches to evaluate the cluster results. In contrast to 
EM, these approaches are useless for FCM. Therefore, most methods for fuzzy partition 
validation are based on the third approach, where cluster validity measures are used. 
Lacking certain knowledge about the data, it has to be determined for each problem 
individually which cluster shape is appropriate, what distinguishes good clusters from 
bad ones, and whether there are data lacking any structure. There are two common 
approaches to create cluster validity measures. 
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3.5.1 Global validity measures 

Global validity measures are mappings g: A(D,R) — » R describing the quality of a 
complete clustering partition using a single real value. The objective function, J m in 
(3.3) is a simple example of such a validity measure. When we determine the cluster 
parameters and memberships, we minimize the J m function. However, the objective 
function is a poor choice for creating validity functions; it is obvious that lim(J m ) = 0, 

but c = n is not the optimal number of clusters for all datasets. In recent literature, a 
number of cluster validity measures have been developed and have addressed the 
problem for specific situations [92-97, 121-131]. 

Definition 3.3 (Partition coefficient - PC, Bezdek): Let Af(R p , R cp ) be an analysis space, 
X c: R p , and U: X^Vg R cp a fuzzy cluster partition. The partition coefficient PC of 
the partition U is defined by 

v pc =1 Z2X- ( 3 - 23 ) 

n i=l k=l 

The inequality — < V PC < 1 holds because of (3.2) and (3.6). Vpc = 1 for all hard 
c 

partitions. If the fuzzy partition contains no information, each object is assigned to each 

cluster with the same degree and the minimum value 1/c of Vpc is obtained. A 

maximum value of Vpc should be considered when making a selection between multiple 

fuzzy partitions, i.e., V PC (U t ) = max V PC (U k ). The optimal partition, U op t, is the most 

l<k<c 

"unambiguous" one regarding cluster assignment of the objects. 
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Definition 3.4 (partition entropy - PE, Bezdek): Let A(R P , R cp ) be an analysis space, X 
c= R P and U:X^Vg R cp a fuzzy partition. The partition entropy PE of the partition U 
is defined by 

V PE=- 1 ZX u kl x ln K)- ( 3 - 24 ) 



n 



i=l k=l 



Similar to the partition coefficient, we can show that < V PE < ln(c) holds for 
any fuzzy partition U: X— >V. When the fuzzy memberships are uniformly distributed, 
Vpe tends towards to ln(c). If U is the most ambiguous fuzzy partition, then the 
maximum value of Vpe is obtained. The entropy, i.e., the mean information content of a 
source which describes a correct partition, is 0, which is also obtained when U is a hard 
partition. To find the best fuzzy partition, we therefore look for the partition with 
minimum entropy, i.e., V PE ("U t ) = min V PE (U k ) . 



l<k<c 



Both PC and PE have the problem of over-fitting in estimating the number of 
clusters, c, because the use of the all memberships in the validity measure does not 
necessarily increase the dependence on this parameter. This is why the PC and PE 
factors tend to increase for large values of c and provide optimal decisions only on 
unambiguous partitions. For a partition to be good, its clusters should satisfy the 
compactness and separateness conditions. A partition contains compact and separate 
clusters if any two objects from the same cluster have a smaller distance than any two 
objects from different clusters. 
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Fukuyama and Sugeno [93] first proposed a simple approach to measure the 
cluster separation using the geometry of cluster centers; the more the clusters scatter 
away from the data central point, the corresponding solution is preferred. 

Definition 3.5 (Fukuyama-Sugeno - FS cluster index): Let (R P , R cp ) be an analysis 
space, X cz R p and U: X^V e R cp a fuzzy partition. The Fukuyama-Sugeno cluster 
index of the partition U is defined by 



V F s=J-XIu h ra d 2 (v k) v), (3.25) 

k=l i=l 



where v = XI-i v k/ c • ^ or eac ' 1 c l uster v k> 1^ k < c, the inequality d 2 (xi,Vk) < d 2 ( v,Vk) 

holds for every data object x^ l<i< n. We, therefore, have Vfs ^ 0. V F s = when there is 
one cluster in the fuzzy partition. The more the clusters are away from the central 
prototype, v the more separated the clusters are. To find the best fuzzy partition, we 
therefore look for partition with minimum Vfs- 

There are still the cases, however, where some clusters are far away from v but 
still close to each other. Dunn [123] proposed a cluster index which measures both the 
compactness and the separation of the partition. This cluster index, therefore, was 
named the compact and separated cluster index. However, Dunn's index is only 
applicable to hard partitions. To solve this problem, Xie and Beni [94] proposed their 
cluster index using a separation factor where the distance between clusters is also taken 
into account. 
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Definition 3.6 (Xie and Beni - XB cluster index): Let (R P , R cp ) be an analysis space, X c 
R p and U: X^V e R cp a fuzzy partition. The Xie-Beni cluster index of the partition U 
is defined by 



n c 



J, 



EKd 2 (x„v k ) 



V XB = 2 - 5- = ^^ , (3.26) 

nxSep(V) 2 nxD min (V) 2 



l<k,l<c,k*l 



where Sep(V) = D min (V) = min {d(v k , v { ) } is the minimum distance between cluster 
centers of the partition. Since, 



n c n c 



ZZ u ki d2 ( x i> v k) ZZ u ki d2 ( x i> v k) * c Z u ki d2 ( x i> v k) 

i=l k=l <> i=l k=l _ i X^ i=l 



< 



n 



=rS- 



Y.< k = l IX 

i-l i=l 

c 

^diam(v k ) 2 

< ^^ < max{diam(v k ) 2 } . 

c i<k<c' k/ 



We have V xp < — - , where DI is the Dunn compact and separated index; 



DI= D - (V) 



max{diam(v v )} 

l<k<c k 



Compared to DI, the Vxb index is a much simpler computation, and it takes the 
membership degrees into account so that a detour via a hard partition is not necessary 
[121]. However, the cluster indices, PC, PE, FS and XB do not consider the clusters 
shapes and volumes. Gath and Geva, based on the concepts of hypervolume and 
density, proposed three cluster hypervolume measures which are the basis of their 
cluster validity index and the indices of many other researchers. 
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Definition 3. 7 (fuzzy hyper volume - FHV): Let (R P , R cp ) be an analysis space, X cz R p 
and U: X— »V e R cp a fuzzy partition. The fuzzy hypervolume of the partition U is 
defined by 

k=l 

n / n 

where Fk, F k = ^u ki d 2 (x i? v k ) /^u ki is the fuzzy covariance matrix of Vk [125]. A 

i=l / i=l 

fuzzy partition can be expected to have a small value of Vfhv if the partition is compact. 
However, Vfhv measures only the compactness of the partition. Rezaee et al. [95] 
extended the Vfhv index by adding a term that measures the scattering between clusters 
in the partition. 

Definition 3.8 (Compose within & between scattering index - CWB, Rezaee et al): Let 
(R P , R cp ) be an analysis space, X cz R p and U: X— »V e R cp a fuzzy partition. The 
compose within and between scattering index [95] of the partition U is defined by 

V CWB = aScat(V) + Dis(V) . (3.28) 

Scat(V) measures the average scattering of c clusters in V and is defined as 

V. 



Scat(V) = 



'FHV 



C X CJ« 



( x )| 



1 n II — 

where ^(X) = — ^ x i - x 



n 



i=l 



— — 1 n 

and x, x = — Vxj is the center point of X. Dis(V) measures the scattering between c 
clusters of V and is defined as 
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Dis(V) = ^X|XIK- v 



mm 



k=l 



111 

V 1=1 



The first term of Vcwb, i.e., Scat() in (3.28), indicates the average of scattering 
variation within the c clusters of the partition. The smaller this term is the more compact 
the partition is. However, this term does not take into account any geometric 
assumptions of the cluster prototypes. The second term of Vcwb, Dis() indicates the 
average scattering separation between clusters. It is influenced by the geometry of the 
cluster centers and will increase if the number of clusters increases. For the best 
partition, we look for the partition with the minimum value of Vcwb- 

Definition 3.9 (Pakhira, Bandyopadhyay and Maulik fuzzy cluster index - PBMF): Let 
(R p , R cp ) be an analysis space, XcR p and U: X->V e R cp a fuzzy partition. The PBMF 
index [96] of the partition U is defined by 



v 

V PBMF 



'IE 

j max 

V c J i J 



(3.29) 



where Ei = na(X) which is a constant for each dataset. The term Ji measures the 
compactness, while the term D max measures the scattering separation of the clusters in 
the partition. A good partition must have either a small Ji or a large D max . To find the 
best partition, we therefore look for the partition with the maximum value of Vpbmf- 

Definition 3.10 (Babak Rezaee - BR cluster index): Let (R P , R cp ) be an analysis space, X 
c R P and U: X->V e R cp a fuzzy partition. The BR index [97] of the partition U is 
defined by 
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V = 

V BR 



Sep(V) 



• + ■ 



max{Sep(V)} max{J 2 } 



(3.30) 



where Sep(V) = 



c ( c 1) k,l=l,k*l 



^C REL (v k ,Vj). Instead of using geometric distance between 



the two centers as the distance of every pair of clusters, Rezaee [97] used the relative 
degree of sharing or connectivity, Crel of the two clusters, defined as in (3.31). 

To compare the performance of the global cluster validity indices from (3.23) to 
(3.30), we chose the three previously used real datasets: Iris, Wine, and Glass and the 
artificial dataset ASET2 with 200 data points in five well-separated clusters (Figure 

3-2). 
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Figure 3-2 : ASET2 dataset with five well-separated clusters 
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For each of the four selected datasets, we ran FCM with the number of clusters 
set from 2 to 10, each partition, corresponding to a cluster number, was evaluated using 
the cluster validity measures. The results are shown separately for individual validity 
measures in Figures 3-3 - 3-9. 
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Figure 3-3 : PC index (maximize) Figure 3-4 : PE index (minimize) 

The two indices, PC in (3.23) and PE in (3.24) behave very similarly; when PC 
decreases (Figure 3-3), then PE (Figure 3-4) increases and vice versa. These two 
measures, however, prefer partitions less ambiguous; they correctly identified the 
number of clusters in the artificial dataset, ASET2, but on the real datasets, where the 
clusters significantly overlap, they failed. 
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Figure 3-5 : FS index (minimize) 



Figure 3-6 : XB index (minimize) 
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Figure 3-7 : CWB index (minimize) 



Figure 3-8 : PBMF index (maximize) 
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Figure 3-9 : BR index (minimize) 
The other validity measures also correctly identified the cluster number in the 
ASET2 dataset. However, with the exception of PBMF with the Iris dataset (Figure 
3-8), all failed to detect the correct cluster number in the real datasets. 

For further comparisons between these validity measures, we used 84 artificial 
datasets [147] generated using a finite mixture model as in Section 2.5.1. Datasets are 
distinguished by the dimensions and cluster number, and we generated (3-2+l)*(9- 
3+l)=14 dataset types. For each type, we generated 6 datasets, for a total of 6*14=84. 
For each artificial dataset, we ran the standard FCM five times with m set to 2.0. In each 
case, the best fuzzy partition was then selected to run the validity measures to search for 
the optimal number of clusters between 2-12 and to compare with the known number of 
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clusters. We repeated the experiment 20 times and averaged the performance of each 
method. Results of these comparisons are shown in Figure 3-10. 
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Figure 3-10 : Performance of the global validity measures on artificial datasets with 

different numbers of clusters 

The PC measure provides better results than the PE measure. Measures using the 
geometry of the prototypes and the geometric distance measures i.e., PBMF, CWB and 
FS, have much better performance than those of validity measures not using prototype 
geometry. However, using geometric information for measuring the partition 
compactness does not help validity measure with better performance. The cluster 
compactness should be measured using the memberships. This is reasonable because the 
data do not always distribute equally, and the cluster volumes, therefore, may differ 
from cluster to cluster; it is not necessary that any two data points having the same 
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memberships should have the same distance to the centers of clusters to which they 
belong. The selection of validity measures therefore should be specific to the individual 
application. 

3.5.2 Local validity measures 

Determination of the optimum cluster number using global validity measures is 
very expensive, because clustering must be carried out for multiple cluster numbers. If a 
perfect partition cannot be recognized by a single run, we could at least filter the 
presumably correctly recognized clusters [121]. A validity measure, called a local 
validity measure is, therefore, necessary to evaluate clusters individually. This validity 
measure helps to correct the partition more directly. The concept of local validity 
measures is that, by running FCM with a possibly maximized cluster number [122] and 
analyzing the resulting partition, we provide an improved initialization for the next run. 
We continue this procedure until finally the whole partition consists only of good 
clusters. If the final partition with only good clusters is found, we have determined the 
optimal number of clusters and the best assignment of data points to clusters. Hence, in 
addition to addressing the problem of cluster validation, the local validity index helps 
with the problem of parameters initialization. 

Definition 3.11 (Relative degree of connectivity of two clusters - Crel)' Let (R , R cp ) be 
an analysis space, XcR p and U: X^V e R cp a fuzzy partition. The relative degree of 
connectivity of the two clusters Vk, vi e V is defined as [97, 127] 

n 

C REL (v k ,v 1 ) = J]min{u ki ,u li }xe i , (3.31) 

i=l 
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where q { =-^u ti ln(u ti ). ei is the entropy of the c-partition at xi g X and has a 



t=i 



maximum value of ln(c) when Xi has the same membership degree to all clusters. Hence, 

we have 0<C REL (v k ,v 1 ) < n n ^ c y . The two clusters Vk, vi are well separated as 

Crel^O. 

Definition 3.12 (Inter-cluster connectivity - Cic): Let (R p , R cp ) be an analysis space, X cz 
R P and U: X^V e R cp a fuzzy partition. The inter-cluster connectivity [129] of the two 
clusters Vk, vi e V is defined as 

C lc (v k ,v 1 ) = |RF kl | + |RF lk |, (3.32) 

where RF kl = { Xi g X | u kl = max{u tl },u H = max{{u tl } Kt , c \ {u kl }} ). 



The Cic(vk,vi) shows the overlap region between clusters Vk and vi. If the two 
clusters are well separated then Cic(vk,vi) — » 0. A greater Cic is an indication of a 
greater degree of the similarity between Vk and vi. If Cic(vk, vi) is larger than a selected 
threshold, say 0.5, then the two clusters Vk and vineed to be combined. 

Definition 3.13 (fuzzy inclusion similarity - Sfi)'- Let (R , R cp ) be an analysis space, X c: 
R P and U: X^V e R cp a fuzzy partition. The fuzzy inclusion similarity [130] of the two 
clusters Vk, vi e V is defined as 



Z min K> u i 

i=l 

f n n 



S FI (v k ,v,) = ^ ir ^ r . (3.33) 



. i=l i=l 
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Definition 3.14 (link-based similarity - Slb)'- Let (R P , R cp ) be an analysis space, XcR p 
and U: X^V e R cp a fuzzy partition. The link-based similarity [131] of the two clusters 
Vk, vi g V is defined as 

c 
E S Mir( V k, V l) 

SlbK,v 1 ) = ^— ^ , (3.34) 



where 



S!„ dir (v k ,v 1 ) = min{S Dir (v k ,v t ),S Dir (v t ,v 1 )}, 
S^=max{S LB (v k , Vl )}, 



l<k,l<c 



and 



%ir(^^l) = 



| V k nV l| 

kuvJ 



Because Vk and vi are fuzzy sets, according to the fuzzy set theory, we have 

n 

^min{u ki ,u H } 

S Dir (v k ,v 1 ) = ^ . 



Z max Ki> u ii} 

i=l 

The cluster merging technique evaluates clusters for their compatibility 
(similarity) using any of the local validity measures discussed above and merges 
clusters that are found to be compatible. In every step of this merging procedure, the 
most similar pair of adjacent clusters is merged if their compatibility level is above a 
user-defined threshold y, and the number of clusters is therefore gradually reduced. 
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Either the compatible cluster merging (CCM) algorithm [126, 127, 128] or the 
agglomerative HC algorithm [132] can be used for this process. 

In the case of Iris dataset, we first ran FCM with the number of clusters set to 
12, which is equal to LVnJ, where n=150, and is clearly greater than the optimal number 
of clusters. The HC algorithm was applied with the weighted linkage used as a local 
validity measure to merge adjacent compatible clusters. In Figure 3-11, it can be seen 
that most of the clusters were merged into two large groups and the third group contains 
only one cluster. 
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Figure 3-11: Dendrogram of the Iris dataset from a 12-partition generated by FCM 
For the Wine dataset, we ran FCM with the number of clusters set to 13, which 
is equal to LVnJ, where n=178, and is larger than the optimal number of clusters. The 
HC algorithm was applied with the weighted linkage. It can be seen in Figure 3-12 that 
the clusters gather into three groups. 
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Figure 3-12: Dendrogram of the Wine dataset from a 13 -partition generated by FCM 

By considering only the highest levels of the dendrograms as in Figure 3-11 and 
Figure 3-12, the number of clusters in the Iris and Wine datasets can be identified 
correctly. However, in the fuzzy cluster analysis, several prototypes can share one 
cluster. A local validity measure based on memberships can only identify a prototype 
pair instead of all of these prototypes. For example, in the case of the Iris dataset, the 
clusters 2, 3, 7, 8 and 12 share the class Virginica, but only cluster 12 was separated to 
form a final cluster. For the Wine dataset, the clusters 1, 2, 3, 4, 6, 9 and 10 share the 
overlap region between classes 2 and 3 but only the clusters 9 and 10 are grouped to 
form a separate cluster. To solve this problem, the best cluster is removed after each 
merging step and the distances among the rest are recomputed [126-132]. This 
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procedure is repeated until no further good clusters are found or the compatible level is 
under the threshold value y [128]. 

3.6 FCM partition matrix initialization 

Theorem 3.4 shows that FCM algorithm is very sensitive to the initial value, U°, 
of the partition matrix. The determination of U°, or the initialization of the partition 
matrix, can help the algorithm to converge quickly to the global optimum or get stuck in 
local optima. A common approach in most partitioning clustering methods is to 
generate the partition matrix randomly. However, this requires different mechanisms for 
different datasets because of differing data distributions. 

120 



100 —-;-+-+!■ 



OU 



60 



40 



I 
+-+-+* i - - - +. +- -+X 

* + * * ■ * *!****** 

5 * 4 ; 


♦ ♦♦J- 

&~ + 

* \ 2 


















+ t 

6 + y 




















i 




* 


3 * 

+ 


+ 
**+ * + 



20 



40 



60 



sc 



100 



-2: 



Figure 3-13 : ASET1 dataset with 6 clusters 
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We used the artificial dataset 1 (ASET1) as an example. This dataset contains 171 
data points in six clusters (Figure 3-13) generated using a finite mixture model as 
described in Section 2.5.1. We ran the three standard clustering algorithms: k-means, k- 
medians and FCM with the number of clusters set to 6, and the partition matrix 
generated using a non-uniform random method. The cluster partition found by each 
algorithm was compared with the classification structure of the dataset. We repeated the 
experiment 20 times and averaged the performance of each algorithm, as shown in 
Table 3.4. Because of the random value of the U°, these algorithms could not always 
converge to the local optimum. Hence, the correctness ratios for every class and the 
whole dataset are sometimes off. 

Table 3.4 : Performance of three standard algorithms on ASET1 



Algorithm 


Correctness ratio by class 


Average Ratio 


1 


2 


3 


4 


5 


6 


k-means 


0.97 


0.87 


1.00 


1.00 


1.00 


0.75 


0.93 


k-medians 


0.95 


0.82 


1.00 


1.00 


1.00 


0.62 


0.90 


FCM 


0.97 


1.00 


0.95 


1.00 


1.00 


0.96 


0.98 



An alternative approach is to search for the best location of the cluster candidate 
using data points in the dataset. An integration of FCM with optimization algorithms 
such as GA and PSO is preferred. To manage a solution set, of which each solution is a 
set of data points selected as cluster candidates, the optimization algorithms try different 
values of the partition matrix. In this way, they solve the problem similarly to the 
method of initializing the matrix randomly, as we mentioned previously, which may be 
appropriate for some datasets, but not for others. 
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A simple and yet effective method for the problem of initializing the partition 
matrix is subtractive clustering (SC). The subtractive algorithm, originally developed by 
Chili [63], is a notable improvement to FCM, in which an improved mountain method 
was proposed to determine the number of clusters and estimate their initial values. 
Instead of generating the cluster candidates randomly, the SC method searches for the 
most dense objects in the dataset. Li et al. [64], Liu et al. [65], and Yang et al. [67] used 
SC to locate the most dense data points as cluster candidates. The partition matrix is 
then computed using the selected data points as cluster centers. However, the SC 
method requires the mountain peak and mountain radii numbers to be specified a priori, 
which may differ among datasets. Yang et al. [66] addressed the problem by a method 
that automatically computes these values from the dataset. However, the data 
distribution may not be identical among clusters, and the estimated parameters may be 
appropriate to some clusters but not others. Cheng et al. [68] proposed a weighted SC 
method that makes the cluster candidate selection more flexible by allowing the cluster 
centers to be located at the means of the dense data points instead of selecting the most 
dense data points themselves. This approach can help FCM work effectively. However, 
it still requires that all the SC parameters are specified a priori. 

3.7 Determination of the number of clusters 

FCM has a disadvantage that the number of clusters must be known in advance. 
When this is not known, the results with different numbers of clusters have to be 
compared using a validity function in order to find an optimal fuzzy partition with 
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respect to the new objective function. There are two common methods to address the 
problem of number of clusters using this approach [121]. 

3.7.1 Between-cluster-partition comparison 

This method defines a global validity measure which evaluates a complete cluster 
partition. An upper bound c max of the number of clusters is estimated, and a cluster 
analysis using FCM is carried out for each number of clusters, c, in {2,...,c max }. For 
each partition solution, the validity measure generates a validity index which is then 
compared with the indices of other partition solutions to find the optimal number of 
clusters. 



0.95 




Number of clusters 

Figure 3-14 : PC index on the ASET2 dataset (maximize) 

For the case of ASET2 dataset, we used Vpc, for example, as global validity 

measure, and set c max to 14 which is LVnJ, where n=200 the size of ASET2. FCM 

algorithm was run with the number of clusters set from 2 to 14; each partition, 

corresponding to a certain number of clusters, was evaluated using Vpc- The results are 
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shown in Figure 3-14. To find the optimal number of clusters, we look for the partition 
with the maximum value of Vpc (Definition 3.3). Therefore, five is the optimal number 
of clusters, and is also equal to the number of classes in ASET2. Hence, in the case of 
ASET2, Vpc correctly detected the number of clusters. 

3.7.2 Within-cluster-partition comparison 

This method defines a local validity measure that evaluates individual clusters of 
a partition. An upper bound c max of the number of clusters is also estimated and a cluster 
analysis is carried out for c=c max . The resulting clusters are then compared using the 
validity measure. Similar clusters are merged into one and very bad clusters are 
eliminated. These operations reduce the number of clusters in the partition. Afterwards, 
a cluster analysis is carried out on the remaining clusters. This procedure is repeated 
until the analysis result no longer contains either similar clusters or very bad clusters. 

Both 3.7.1 and 3.7.2 are based on the validity measures that we discussed in 3.5. 
Hence, the problems with using cluster validity measures may cause these methods to 
remain in local optima when searching for the "best" number of clusters; they assume 
that the cluster analysis results are optimal for the given numbers of clusters. However, 
FCM does not guarantee this [132]. Hence, an optimal cluster partition found by these 
methods is possibly only a local optimum. In general [121], to evaluate a possibilistic 
cluster analysis, we must consider both the quality of the clusters and the proportion of 
the classified data. This is especially difficult with quality measures for individual 
clusters according to the method in 3.7.2 [121]. In this case, the method in 3.7.1 is 
therefore recommended. 
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3.8 Determination of the fuzzifier factor 

The performance of FCM is affected by an exponent parameter, called the 
fuzzifier factor, m. This parameter affects both the convergence rate and the cluster 
validity of the algorithm. 

m=l, FCM behaves like K-Means, say all clusters are disjoint. 

m^oo, we have 



limiu 



= lim 



= lim 



m-l 



d(x,>v k ), 



1=1 



m-l 



1=1 



^ d 2 ( Xi ,v k ) V-' 
v d 2 (x i ,v ] ) y 



.^(Xf.vOy 



= - for 1 < k < c and 1 < i < n, 

c 



lim{v k }=lim 2«Sx,/S»S 

L i=l / i=l 

n 

2>. _ 



i=l 



■ = x for 1 < k < c, 



where x is the grand mean of X. Thus, when m goes to infinity, all the data points are 
considered to belong to only one cluster. Wu [144] ran FCM on the Iris dataset with 
different values of m (m=l.l, 1.5, 2, 3, 4,. . ., 50) and the number of clusters set to 3, and 
measured the misclassification rate in each cluster partition as shown in Figure 3-15. 

As m becomes large, the FCM algorithm becomes stable and no improvement can 
be achieved, even with additional iterations. If Ue M fcn , the constraint as in (3.6) makes 
it difficult to interpret Uki as the presentative of the k th cluster at xi. This is a very 
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important problem with FCM, and any application of FCM should take the 
determination of m into account. 
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Figure 3-15 (Wu [144]) : Impact of m on the misclassification rate in Iris dataset 

To date, there are no reliable criteria for the selection of the optimal fuzzifier for a 
given set of training vectors. Instead, the best value, with respect to a given value range 
of m, is fixed through experimentation; i.e., this choice is still heuristic. To our 
knowledge, only a few methods exist to determine an optimal value of m. Romdhane et 
al. [57] proposed a method to estimate the value range that should contain the optimal 
values of m, by defining the limits of the overlap ratio between clusters. An iteration 
process is then carried out until the optimal value is reached. This approach has 
provided an easy yet powerful way to estimate the value of m. However, it has the 
drawback that its optimal solution is strongly dependent on the initial random value of 
m. Dembele et al. [60] developed a method to estimate the value of m using the number 
of clusters and the estimated membership. The key idea of this method is that as long as 
m runs toward to its upper bound value, the membership of all data points will converge 
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to 1/c, where c is the number of clusters. They estimate the upper bound value of m 
using this idea. This method does not have the problem of local optima caused by the 
use of an initially random value of m. However, it requires the number of cluster to be 
specified a priori. To address this issue, Schwammle et al. [61] proposed a method 
using cluster validity indices to estimate the value of c before determining the optimal 
value of m. Similarly, Zhao et al. [62] proposed a model-based approach for the m 
estimation using the predefined value of c, which varies from 2 to C where C is the 
possibly maximum number of clusters. The estimation methods of these two approaches 
are based on the fuzzy partition generated using the predefined value of c. This value is 
not guaranteed to be optimal, because most of the cluster validity measures are not 
appropriate to the real datasets [147]. The fuzzy partition generated by the FCM 
algorithm, using the selected number of clusters, is also not optimal because FCM may 
remain in local optima. 

3.9 Defuzzification of fuzzy partition 

Defuzzification in fuzzy cluster analysis is a procedure to convert the fuzzy 
partition matrix U into a crisp partition which is then used to determine the 
classification information of the data objects. The most popular approach for 
defuzzification of fuzzy partition is application of the maximum membership degree 
(MMD) principle [148, 152, 153]. The data object Xi is assigned to the class of Vk if and 
only if its membership degree to cluster Vk is the largest, that is 

u kl =max{u H }. (3.35) 

1=1. .c 
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This approach may be inappropriate in some applications, because FCM 
membership is computed using distance between the data object and cluster center. The 
use of membership degree can assign marginal objects of a large cluster to the 
immediately adjacent small cluster. This is illustrated in Figure 3-16, where if a data 
object is in the gray rectangle, it may be incorrectly assigned to cluster 3 instead of 
cluster 2. 
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Figure 3-16 : Artificial dataset 5 (ASET5) with three clusters of different sizes 

Genther, Runkler and Glesner [151] proposed a defuzzification method based on 
the fuzzy cluster partition (FCB), where the partition information is applied in 
membership degree computation. For each data object Xi, i=l,...,n, an estimated value, 
(Li(xi), is computed using the fuzzy partition. The set of estimated values of X, 
{n(xi)}i=i,...,n is then used to compute the estimated value of Vk, Wk, k=l,. . .,c, as: 



w k =s^^i)/Z u S 



(3.36) 



i=l 



i=l 



The distance between the data object and cluster center, d F , in FCB is computed 
using both the actual and the estimated information as: 
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d F (x, , v, ) - nd ' ( f" v '> + Pd;< f (Xi) ' W ' ) , (3.37) 

\ max{d (x-,v } max{d (^Xj^Wj)} 

V j,i j,i 

where d?() is used, instead of d(), to compute the membership degree. The two 
parameters, a and (3, weight the distances from X and its estimated to cluster prototypes. 

Chuang et al. [150], in the Neighborhood Based Method (NBM), on the other hand 
proposed adjusting the membership status of every data point using the membership 
status of its neighbors. Given a cluster Vk, k=l,...,c, a data point, xi, i=l,...,n, receives 
an additional amount of membership degree from its neighbors: 

K = 2X- ( 3 - 38 ) 

XjeNBCxi) 

This additional information is used in computing the membership degree, as in 
(3.39), so that the membership level of each data point is strongly associated with those 
of its neighbors that will help with determination of classification information. Here, 

U NB = M*(KY (339) 

Z(uH) P x(h H ) q 

1=1 
Similar to FCB, the NBM method requires a definition of neighborhood boundaries 
and the values of p and q which weight the position and spatial information. Chiu et al. 
[149] also used the spatial information to compute the distance between data point and 
cluster center in their sFCM (Spatial FCM) method, as in (3.40), 

i nb r n 

d s ( Xi ,v k ) = — £[(l-a i j )d(x i ,v k ) + oid(x j ,v k )], (3.40) 
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where NB is the number of neighbors of every data point, a- is the weighted factor of 
Xj, a neighbor of xi, defined as: 

aj = l/{l + ex V (A)/o i )), (3.41) 

where A^ = d(x j ,x i ). sFCM uses ds(), instead of d(), to compute the membership 
degree, and updates cluster centers, { v k }, k=l,. . .,c, as follows: 



v^ 



=2>£ 



1 NB 

YO-oOXj +a i i x i 



K . (3.42) 



Algorithm performance measures 

The overall performance of defuzzification methods can be measured based on the 
number of data objects that are misclassified, defined as in (3.21); the cluster label of 
each data object is compared with its actual class label. If any do not match, then a 
misclassification has occurred. The compactness measure, defined as in (3.43), can be 
used to validate how the actual classification structure retains as the missing data are 
updated, 

c N k 



compactness = ^^d(x i9 v k ). (3.43) 



k=l i=l 



We tested the four defuzzification algorithms: MMD, FCB, NBM and sFCM using 
the two datasets ASET4 and ASET5 with various values of the fuzzifier factor, m. Each 
algorithm was run 100 times and the results were averaged. The number of neighbors 
was set to 8. For the ASET4 dataset which contains 300 data objects of 11 well- 
separated clusters in a 5-demensional data space, the result is shown in Table 3.5. 
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Table 3.5 : Algorithm performance on ASET4 using MISC measure 





1.25 


1.375 


1.5 


1.625 


1.75 


1.875 


2 


MMD 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


FCB 


12.24 


12.24 


12.24 


12.24 


11.72 


11.71 


11.97 


NBM 


5.89 


5.56 


1.97 


1.97 


1.94 


2.03 


1.80 


sFCM 


12.23 


9.60 


0.30 


0.75 


0.84 


0.84 


0.84 



In the Table 3.5, only the MMD method correctly detected the classification 
information across various values of m. The other methods performed worse, 
particularly the methods that used spatial information to revise the distance between 
data object and cluster center, FCB and sFCM. That may be because the assumption of 
8 as the maximum number of neighbors for each data point, which is correct in spatial 
space, is not appropriate for generic data space. 

We ran the same benchmark on the ASET5 that contains 150 data points of 3 
clusters different in size, see Figure 3-16. The result is shown in Table 3.6. 

Table 3.6 : Algorithm performance on ASET5 using MISC measure 



algo^"\ 


1.25 


1.375 


1.5 


1.625 


1.75 


1.875 


2 


MMD 


16.00 


16.00 


16.00 


16.00 


16.00 


16.00 


16.00 


FCB 


4.32 


4.32 


4.59 


4.59 


4.32 


4.32 


4.32 


NBM 


27.19 


27.19 


27.19 


27.19 


27.19 


27.19 


27.19 


sFCM 


17.00 


17.00 


17.00 


17.00 


16.00 


16.00 


16.00 



The FCB algorithm performed better than the other methods. The methods that use 
spatial information in computing distance between data object and cluster center, e.g., 
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FCB and sFCM provided better results. However, these algorithms failed at producing 
classification information for this dataset. 

The test results have shown that it is not appropriate to apply the spatial 
information into the defuzzification problem in cluster analysis of generic data, because 
properly determining the number of neighbors for each data point is complicated, and 
the choice of the two parameters, a and (3, in these methods may not be easy due to the 
data distribution difference among different datasets. 
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4. Methods 

4. 1 Microarray meta-analysis 

4.1.1 Selection of samples and features 

Key challenges in meta-analysis methods arise from the small sample sizes 

typical of microarray studies and variations due to differences in experimental 
protocols. Both affect the results from individual studies and inevitably degrade the 
final results of meta-analysis. In addition, the main outcome of these approaches is the 
ranking of genes for significance, however, it is still unknown which feature, gene 
expression value, statistic test p-value, likelihood measurement, or any other statistical 
measures, best represents of outcome measures. This chapter addresses the problem of 
choosing samples from multiple studies with the goal of reducing the variance and 
discovering more biological meaning. In addition, because microarray data are specific 
to the biological context of the experiment, in the solution proposed, the meta-analysis 
will reflect the appropriate biological context. 

4.1.2 Sample and feature selection method 

Common approaches for sample and feature selection are PCA, hierarchical 

clustering analysis, optimization algorithms and ranking methods. The first three 
approaches are limited because their results are strongly affected by the lab in which the 
experiments were carried out [25, 49]. These approaches are also computationally 
complex, because they use pair- wise sample distance measurements. When the number 
of probes or features is large, this process becomes time consuming, particularly with 
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the use of external metrics. Lastly, these methods require the data to be obtained using 
the same microarray platform. 

The alternative approach is the ranking method. Because ranking itself is 
invariant to within-array monotonic data transformations, by applying the rank-based 
TSPG (Top-Scoring Pair of Groups) method, which performs on-chip comparisons 
within each microarray, no data normalization or transformations are required before 
integration [169]. 

Feature ID mappings / Selection 

For microarray data obtained using the same generation of the same platform, 
we directly merge individual data sets using probe sets common to all experiments. 

To merge data from different generations of the same microarray platform, we 
first use the Affymetrix probeset ID mapping to annotate all probesets. Probes that are 
not annotated will be processed using EntrezGene with UniGene data. Probes having 
two or more gene IDs will have their data replicated, one for each unique gene. 

To merge data from different platforms, we first map all probe identifiers to 
gene identifiers. Probes having two or more gene IDs will be preprocessed as above. 
This method avoids excluding probes from further analysis. In this study, we focus on 
integrating microarray data from the CMAP datasets which were generated using the 
same platform but different generations. Specifically, we integrate microarray data 
generated from the Affymetrix HGU-133x and HG-U95x platforms to obtain data sets 
with large (>500) sample sizes. This integrated data set is used to discover common 
gene expression signatures in response to drug treatments [170]. 
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Sample selection 

We developed RPPR algorithm, a combination of the rank product (RP) 
statistics method with a modified version of the PageRank™ (PR) algorithm and the 
goodness-of-fit Kolmogorov-Smirnov statistical test method. The RP method generates 
the gene rank for gene expression profiles. The PR algorithm is then applied to augment 
the gene ranks using Protein-Protein interaction and GO Biological Process annotation 
data. The KS statistical test method is used to filter the expression profiles of which the 
gene ranks best describe a given gene expression signature. 

Given a sample expression profile X and a gene expression signature S, denote 
S u and Sd as the lists of up-regulated and down-regulated genes, respectively. The 
connectivity score between X and S is computed using the Kolmogorov-Smirnov scores 
of goodness-of-fit (KS) of the two sub lists in S to the profile X. The KS score of a tag 
list s to X is calculated as follows: 

Sort X based on the appropriate expression status of genes. 

Denote by t and n the lengths of s and X, respectively. Construct a vector O of 
the position (1,. . .,n) of each probe set of s in X and sort these components in ascending 
order such that 0(j) is the position of tag j, where j = 1, 2,. . ., t. Compute the following: 



a = max 

l<i<t 



O(i) 



n 



b = max 

l<i<t 



O(i) (i-1) 



n 



t 



(4.1) 



next 



KS(s,X) = 



a, a >b 

-b, a<b' 



(4.2) 
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For each expression profile associated with a specific drug, X p , p=l,...,P, where 
P is the number of profiles, let 

KS^ p =KS(S u ,X p ), 

KS p own =KS(S d ,X p ). 

The KS connectivity score of the profile X 1 to the gene expression signature S is 
defined as 

s'=KS up -KS down . (4.3) 

Let 

SL=m P ax(s'), SL=mW). 
1=1 1=1 

The normalized connectivity score of the profile X p to the gene expression 
signature S is defined as: 



S p = 



0, sign(KS;) = sign(KS p own ) 

s p /SL, s p >0 . (4.4) 

-s p /sL, s p <0 



Because RPPR uses only the preprocessed expression data, it is also appropriate 
for datasets generated using next generation sequencing, RNAseq and other high- 
throughput technologies. 

Method testing 

We tested RPPR using the CMAP datasets to detect expression profiles strongly 
related to the expression signatures, SI and S2, of two drugs, the Histone deacetylase 
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(HDAC) inhibitor (SI) and estrogen (S2). We compared the performance of RPPR with 
those of Lamb et al. [25] and Zhang et al. [26]. 

HDAC inhibitor signature 

CMAP arrays were preprocessed using RMA resulting in 564 expression 
profiles. The RP [29] method was then applied to every study to detect genes 
differentially expressed and to order genes in each profile. The pattern matching 
ranking method was used to score all the profiles with respect to the HDAC inhibitor 
signature [25]. We used a cutoff of 0.95 for the normalized connectivity scores to select 
profiles strongly related to the signature. Results, compared with those of Lamb et al 
[25] and Zhang et al [26], are shown in Tables 4.1 and 4.2, and Figures 4-1 and 4-2, 
respectively. 

Table 4.1 : Predicted HDAC antagonists 
(1: Detected; 0: Not detected) 



Drags 


Lamb 


Zhang 


RPPR 


valproic acid 


1 






trichostatin A 


1 






Resveratrol 










HC toxin 


1 






Vorinostat 


1 






Wortmannin 


1 








5666823 










Prochlorperazine 










1 7-allylamino-geldanamycin 
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Table 4.2 : Predicted HDAC agonists 
(1 : Detected; : Not detected) 



Drags 


Lamb 


Zhang 


RPPR 


estradiol 


1 


1 




genistein 


1 


1 




nordihydroguaiaretic acid 


1 


1 




thioridazine 










LY-294002 










staurosporine 










indometacin 


1 


1 




alpha-estradiol 


1 


1 




tretinoin 










doxycycline 


1 







troglitazone 


1 








5149715 


1 








12,13-EODE 


1 








1 7-allylamino-geldanamycin 










butein 










wortmannin 










tetraethylenepentamine 










diclofenac 










4,5 -dianilinophthalimide 










pirinixic acid 










haloperidol 
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Figure 4-1 : Venn diagram for predicted HDAC antagonists 
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Figure 4-2 : Venn diagram for HDAC inhibitors 

Figure 4-1 illustrates results of the three approaches using the HDAC inhibitor 
signature (SI). The results of RPPR are similar to those of Lamb et al. [25] because 
both used the same scoring method. Vorinostat, trichostatin A and valproic acid, well- 
known HDAC inhibitors, were all identified by RPPR. HC-toxin was not identified by 
Lamb et al. [25] or Zhang et al. [26], but was identified by RPPR. Although 17- 
allylamino-geldanamycin, wortmanin, prochlorperazine were not reported in the top 20 
profiles from RPPR, they were ranked with high scores: 0.639, 0.557 and 0.657, 
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respectively. It is interesting that only RPPR detected resveratrol which has recently 
been recognized as an HDAC inhibitor like valproic acid [119]. 

Estrogens signature 

Table 4.3 : Predicted estrogen receptor antagonists 
(1 : Detected; : Not detected) 



Drugs 


Lamb 


Zhangmine 


RPPR 


Tamoxifen 










Fulvestrant 




1 




trichostatinA 









Raloxifene 




1 




Monastrol 




1 




5224221 




1 




Oxaprozin 









HCtoxin 










Vorinostat 




1 




LY-294002 




1 




Demecolcine 









5248896 










Y-27632 





1 





Sirolimus 





1 





1 7-allylamino-geldanamycin 





1 





Monorden 





1 





Sodiumphenylbutyrate 





1 





Deferoxamine 





1 





Indomethacin 





1 





Tretinoin 





1 
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Figure 4-3 : Venn diagram for predicted 
estrogen receptor antagonists 
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Figure 4-4 : Venn diagram for predicted 
estrogen receptor agonists 

For the Estrogen signature (S2), Figures 4-3 and 4-4 show that results of RPPR 
are similar to those of Lamb. It is interesting that both RPPR and Zhang's identify 
estradiol with the highest rank, but only RPPR detected tamoxifen as negatively related 
to the estrogen signature. In fact, tamoxifen is a compound that works by blocking the 
estrogen receptors on breast tissue cells and slowing their estrogen-induced growth. 
Tamoxifen has recently been chosen as one of the best replacements for fulvestrant in 
breast cancer treatment. The results of RPPR significantly agreed with those of Lamb et 
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al. (2006) and Zhang et al. (2008). Some of RPPR's results which were new to those of 
Lamb et al. (2006) and Zhang et al. (2008) were also proved by recent literature. 

Table 4.4 : Predicted estrogen receptor agonists 
(1 : Detected; : Not detected) 



Drugs 


Known target/mechanism 


Lamb 


Zhang 


RPPR 


Valproic acid 


Epilepsy treatment, ER stimulator 




1 


1 


trichostatinA 


HDAC inhibitor, ER sensitizer 




1 


1 


HCtoxin 


HDAC inhibitor, 

ER signaling pathway related 


1 


1 


vorinostat 


Skin cancer treatment, 

Advanced- stage breast cancer related 




1 


1 


wortmannin 


Longterm estrogen deprivor (usually 
used with tamoxifen for a better ER 
inhibitor) 










5666823 


Similar to 2-deoxy-D-glucose (0.7765) 
like Tamoxifen (0.7143) 





1 





prochlorperazine 


Nausea and vomiting treatment, 
Endocrine: not used with hormone 
replacement estrogens. 





1 





17-allylamino- 
geldanamycin 


Hsp90 inhibitor, Hsp90 is required for 
the proper functioning of ER 





1 






Although RPPR used the same goodness-of-fit scoring method as Lamb et al. 
[25], it is based on different distribution functions generated by the different methods to 
detect and rank genes that are differentially expressed. In addition, instead of using the 
same distribution function for both up-regulated and down-regulated genes, RPPR used 
the up-regulation distribution of the drug-specific profile for the up-regulated genes and 
the down-regulation distribution of the profile for down-regulated genes. This helps to 
address the weakness of KS goodness-of-fit test because it is less sensitive at the tailend 
of the distribution. In restricting the range of the distribution function, the results 
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produced by RPPR are very close to those of Lamb et al. [25] and Zhang et al. [26] 
although no change was made to the step function part of the distribution function. This 
restriction helps speed up the scoring process when the distribution function can be 
approximated well by managing the step function. 

The experimental results in this study also show that, although CMAP datasets 
are generated from unreplicated single arrays, the use of meta-analysis with the rank 
product and page rank methods, and biological knowledge reduces the variance of the 
selected samples. The results therefore have strong agreement with the biological 
context of interest; e.g., the HDAC inhibitors and Estrogen receptor. 

4.1.3 Methods for feature metric 

One of the most important issues with meta-analysis is how to define a 

similarity metric across samples or studies so that results have strong biological 
relevance. Recent research has shown that, because of differences in experimental 
designs and laboratory- specific effects, integrated gene expression data from multiple 
studies may have poor correlation under the biological context of the interest. To 
overcome this problem, we propose: 

• To use protein-protein interactions (PPI) with the Random Walk algorithm to 
determine gene-gene proximities. 

• To use GO cellular component (CC) and biological process (BP) annotations 
with term-term semantic similarity to determine gene-gene similarities. 



108 



4.2 Fuzzy cluster analysis methods using Fuzzy C-Means (FCM) 

The FCM algorithm has several limitations: the initialization of the partition 
matrix can cause the algorithm to remain in local optima, the partially missing data may 
cause the algorithm to work improperly, there is no strong mathematical/statistical and 
biological basis to validate the fuzzy partition, FCM cannot determine the number of 
clusters in the dataset, and it does not have an effective method to address the problem 
of defuzzification, where the method using the maximum membership degree principle 
(MMD) has been proved inappropriate for gene expression data. 

To address these problems, we first modify the FCM objective function using the 
partition coefficient because of its ability to maintain the partition compactness (Section 
3.5.1). For initialization of the partition matrix, we developed a new Subtractive 
Clustering (SC) method using fuzzy partition of the data instead of the data themselves 
[146]. To address the problem of partially missing data, we developed two new 
imputation methods using density and probability based approaches [154, 155]. 
Regarding the cluster validation problem, we developed a new evaluation method for 
fuzzy clustering, particularly for gene expression data [147]. The evaluation model 
considers both the internal validity, which is statistical in nature, and the external 
validity, which is biological. For the problem of cluster number, we developed a new 
clustering algorithm that integrates our algorithms with the optimization genetic 
algorithm (GA) to manage a set of solution candidates [157]. Using our fuzzy cluster 
evaluation method, the optimization algorithm can search for the best solutions during 
its evolutionary processes. To produce classification information using fuzzy partition, 
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we developed a novel defuzzification method that generate a probabilistic model of the 
possibilistic model described by the fuzzy partition and apply the model to produce the 
classification information of the dataset [156]. 

4.2.1 Modification of the objective function 

Both the partition coefficient (PC) and partition entropy (PE) are best measures 
for the compactness of fuzzy partition. Because PC performs better than PE regarding 
this issue (Section 3.5.1), PC was, therefore, used as the compactness factor in the FCM 
objective function of our method: 

jr n (U,V) = XZ<d 2 (x 1 ,v k ) + ^Zu» ->min, (4.5) 

i=l k=l i=l k=l 

where |u, the learning rate, is defined as in (4.78), Section 4.3.1. 

We consider minimizing (4.5) w.r.t. (3.6) using the method of Lagrange multipliers. Let 
the Lagrange multiplier be A^ i=l,. . .,n and put 

i=l k=l i=l k=l i=l k=l 

For the necessary optimal condition, 

dL 



du ki 



^ = m(uS- 1 )[d 2 (x 1 ,v k ) + M ]-^ i =0, (4.6) 



dv k i=i 
With the same reasoning as in (3.9) and (3.10), we obtain 



fix n 

' ^ = 2X<h-v k | = 0. (4.7) 
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u ki = 



X, 



^ 



m(d (x ; ,v k ) + n) 



m-l 



X, V- 1 



m 



1 



v d (x i5 v k ) + n y 



m-l 



(4.8) 



Summing up for k=l,. . .,c and taking (3.6) into account, we have, 



Hence, 



l = Zu ki = 



f\ \ 



k=l 



^ 



v m y 



m-l , 



k=l 



^ 



v d 2 (x i ,v k ) + |a y 



m-l 



X: V-l 



m 



Z 

k=l 



d2 ( X 1? V k) + l Ll . 



m-l 



Together with (4.8), we obtain the following for the membership in the FCM version of 
our method, 



1 
d 2 (x 1? v k ) + |u 



V-i 



^ 



1 

d 2 (Xi,V!) + n 



^m-l 



(4.9) 



Similarly, we derive from (4.7) for each k, 1 < k < c: 



i=l / i=l 



(4.10) 



The update model of = {U,V} in our FCM method used (4.9) and (4.10). 

4.2.2 Partition matrix initialization method 

FCM has recently been integrated with optimization algorithms, such as, the 
Genetic Algorithm, Particle Swarm Optimization, and Ant Colony Optimization [161- 
166, 176-177]. Alternatively, a Subtractive Clustering (SC) method has been used with 
FCM, where SC is used first to determine the optimal number of clusters and the 
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location of cluster centers [105-108] and FCM is then used to determine the optimal 
fuzzy partitions [105, 111-113]. While this approach can overcome the problem of 
initialization of parameters, it still requires a priori specification of the parameters of the 
SC method: the mountain peak and the mountain radii. SC uses these two parameters to 
compute and amend the value of the mountain function for every data point while it is 
looking for the cluster candidates. Most approaches using the traditional SC method use 
constant values for these parameters [105, 109-111]. However, different datasets have 
different data distributions, and, therefore, these values need to be adjusted accordingly. 
Yang and Wu [106] proposed a method to do this automatically. However, because the 
data densities are not always distributed equally within the dataset, the automatic values 
may be appropriate only for some data points. 

In general, use of the SC method to determine the optimal number of clusters is 
based on the ratio between the amended value and the original value of the mountain 
function at every data point. The data points at which the ratios are above some 
predefined cutoff are selected and their number is used as the optimal number of 
clusters [105, 107, 110]. This approach, however, requires the specification of the cutoff 
value which will differ among datasets. 

Method design - histogram-based density 

Given = {U,V}, a fuzzy partition on X, the accumulated density at Vk, 
k=l,...,c, is calculated [112] as 
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Acc(v k ) = Xu kl . (4.11) 



i=l 



By assuming a uniform density of the data observed in each cell, an estimate 
fhist(x) of the underlying probability density function f(x) at any point x of Vk can be 
computed by: 

W*) = ^- (4.12) 

sh 

where h is the bin width and s is the number of bins generated on Vk. The density at each 

data point xi is estimated, using a histogram based method, as 



dens(x x ) = ^Acc(v k )xu kl . (4.13) 



k=l 



However, the histogram density estimator as in (4.12) has some weaknesses. It 
is a discontinuous function, and the choice of both the values of h and s has quite an 
effect on the estimated density. The apriorism needed to set those values make it a tool 
whose robustness and reliability are too low to be used for statistical estimation. The 
density estimator in (4.13) is therefore more true at the centers {vk} than at other data 
points [1 12]. If we try to find the most dense data points, c, using (4.13), we will obtain 
the centers of c clusters of the fuzzy partition. To address this problem, we use the 
concept of strong uniform fuzzy partition. 

Definition 4.1 (strong uniform fuzzy partition) [112]: Let mi < ni2 < ... < m c be c fixed 
nodes of the universe d (or universal set), d=[a,b], such that mi = a and m c = b, and 
c>2. We say that the set of c fuzzy subsets Ai, A 2 ,..., A c , identified with their 
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membership function jliai, |^a2,.-- M^ac defined on 0, form a strong uniform fuzzy 
partition of 0, if they fulfill the following conditions, for 1< k <c, 

1) |LiAk(m k )=l (the protoype). 

2) If x g[m k _i, m k +i] then |iAk(x) = 0. 

3) |LiAk(x) is continuous. 

4) |LiAk(x) monotically increases on [mti, m k ] and monotically decreases on 
[m k , m k+ i]. 

5) Vx e &, 3k such that |LiAk(x) > 0. 

c 

6) Vx e &, ^|w A k( x ) = 1 (strength condition). 

k=l 

7) m k +i-m k = h, 1 < k < c. 

8) |LiAk(m k -x) = |LiAk(m k +x), 1< k < c. 

9) Vx G[m k , m k +i], |LiAk(x) = |LiAk-i(x-h) (same shape at overlap region). 

Note that the 6 th condition is the strength condition, i.e. that without this condition 
the fuzzy partition is no longer strong. 

Proposition 4.1 (extended fuzzy partition) [112]: Let A k , k=l,..., c, be a strong uniform 
fuzzy partition on D, extended on D', then 

1) Vx, cgD, 3! k g {1,..., c-1}, such that Vk g {k , k +l}, |iAk (x) = 0, and 

MAko(x) + |iAkO+l (x)= 1. 



iix k+i 
2) Fork= l,...,c, |n Ak (x)dx = h. 



3) 3K A : [-!,!]-> [0,1], such that, n Ak (x) = K A L xU I (mk-l,mk + l) ! 
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1 

and [ K A (u)du = 1 . 



Definition 4.2\ The fuzzy histogram density estimator defined on D is given by 



1 c 

f hist ( x ) = -r Z Acc ( v k ) x u Bk ( x ) , 
nh k=1 



where {Bk}k=i,...,c is a strong uniform fuzzy partition defined on D. 



(4.14) 



It can easily be shown that fhi St > and J fhi St (x)dx = 1, and that fhist goes through 
the c points (mk, Acc(vk)/nh). 

We now define a new fuzzy partition f = {U f ,V}, based on Definition 4.1 and 
Proposition 4.1, using = {U,V}, 



( h-VkllV 1 / r ( V--A\ 



-1 



U k! = 



1=1 



(4.15) 



V J 



where 



= Z p ( 



x i|V k )Xi-VJ 



(4.16) 



i=l 



Because {U f ,V} is a strong uniform fuzzy partition [112], using Definition 4.2, the 
density at every data point can then be estimated as 



dens(x i ) = ^Acc(v k )xu ki . 



(4.17) 



k=l 



Fuzzy mountain function amendment 



Each time the most dense data point is selected, the mountain function of the 
other data points must be amended to search for new cluster centers. In the traditional 
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SC method, the amendment is done using the direct relationships between the selected 
data point and its neighborhood, i.e., using a pre-specified mountain radius. In our 
approach, this is done using the fuzzy partition and no other parameters are required. 
First, the accumulated density of all cluster centers is revised using 

Acc t+1 (v k ) = Acc t (v k )-M;xP(v k | X ;), (4.18) 

where x t is the data point selected as the new cluster center and M t is the mountain 
function value at time t. The density at each data point is then re-estimated using (4.17) 
and (4.18) as follows, 

dens t+1 (x 1 ) = XAcc t+1 (v k )xu kl 

k=l 

= X[Acc t (v k )-M;xP(v k |x;)]xu kl 

k=l 
c c 

= XAcc t (v k )xu kl - M;ju ki xP(v k | x' t ) 

k=l k=l 

c 

= dens t (x 1 )-M;^u kl xP(v k |x;). 

k=l 

Hence, the estimated density at each data point is amended as: 

dens t+1 (x 1 ) = dens t (x 1 )-M;Ju kl xP(v k \x\). (4.19) 

k=l 
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fzSC -fuzzy subtractive clustering algorithm [146] 
Steps 

1) Generate a fuzzy partition A using FCM. 

2) Create a strong uniform fuzzy partition B using (4. 1 5). 

3) Estimate the density at every xi e X using (4.17). 

4) Select the most dense data point as cluster candidate. 

5) If the stop-condition is met then Stop. 

6) Re-estimate the density at every xi using (4. 19). 

7) Go to Step 4. 

To determine the optimal number of clusters in the dataset, we first locate the 
most dense data points, LVnJ. Only the data points where the ratio between the mountain 
function values at time t and time is above 0.95, which is considered a stop-condition, 
are selected. This number is considered the optimal number of clusters in the dataset. 

Method testing 

To evaluate the ability of fzSC in searching for the best cluster centers, we ran 
fzSC on the ASET1 (Figure 3-13) with six clusters. The result is shown in Figure 4-5. 

We compared fzSC with three standard algorithms, k-means, k-medians and 
fuzzy c-means, using the ASET1 dataset (Figure 3-13). Each algorithm was run 20 
times and the correctness ratio was averaged. Results are shown in Table 4.5. 
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Figure 4-5 : Candidate cluster centers in the ASET1 dataset found using fzSC. Squares, 
cluster centers by FCM; dark circles, cluster centers found by fzSC. Classes are labeled 1 to 6. 



Table 4.5 : Algorithm performance on the ASET1 dataset 



Algorithm 




Correctness ratio by 


class 




Avg. Ratio 


1 


2 


3 


4 


5 


6 


fzSC 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


k-means 


0.97 


0.87 


1.00 


1.00 


1.00 


0.75 


0.93 


k-medians 


0.95 


0.82 


1.00 


1.00 


1.00 


0.62 


0.90 


FCM 


0.97 


1.00 


0.95 


1.00 


1.00 


0.96 


0.98 



With a correctness ratio of 100% for every class and the whole dataset, fzSC 
always successfully detected the optimal partition of ASET1 while the other methods 
using a random method for partition matrix did not. 
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To check the ability of fzSC to detect the number of clusters, we ran fzSC on the 
artificial datasets 20 times each and averaged the performance. 

Table 4.6 : fzSC correctness in determining cluster number on artificial datasets 



The number of clusters 


The dataset dimension 


generated in the dataset 


2 


3 


4 


5 


5 


0.97 


1.00 


1.00 


1.00 


6 


1.00 


0.98 


0.90 


1.00 


7 


1.00 


1.00 


1.00 


1.00 


8 


1.00 


0.99 


0.97 


1.00 


9 


0.87 


0.99 


1.00 


0.96 



Regarding the problem of partition matrix initialization, fzSC provides better 
results in comparison with the three standard clustering algorithms using a random 
mechanism. 

When checking the ability of fzSC on artificial datasets, the performance is 
significant. It performs worse on the datasets with nine clusters in the data space with 
the dimension of 2. We tried to plot some of these datasets. The results show that some 
clusters mostly overlap and tend to merge into one. We, therefore, can assume that the 
performance of fzSC is not good on these datasets because the data were not properly 
generated. 

For the real datasets, Iris, Wine, Glass and Breast Cancer Wisconsin, the number 
of clusters are 3, 3, 6, and 6 respectively. The data in the Wine and Glass datasets were 
normalized by attributes before clustering. The Pearson correlation distance, which is 
the most appropriate for real datasets, was used to measure the similarity between data 
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points. We ran fzSC 20 times on each of these datasets with the value of L, the number 
of solutions, set to 15. The number of clusters found was compared with the known 
number in the dataset. Results are shown in Table 4.7. 

Table 4.7 : fzSC performance on real datasets 



Dataset 


# data points 


#clusters 


predicted #clusters 


correctness ratio 


Iris 


150 


3 


3 


1.00 


Wine 


178 


3 


3 


1.00 


Glass 


214 


6 


6 

5 


0.95 

0.05 


Breast Cancer 
Wisconsin 


699 


6 


6 

5 


0.65 

0.35 



We have proposed a novel algorithm to address the problem of parameter 
initialization of the standard FCM algorithm [146]. Our method is novel because the 
subtractive clustering technique uses fuzzy partition of the data instead of the data 
themselves. The advantages of fzSC are that, unlike traditional SC methods, it does not 
require specification of the mountain peak and mountain radii, and, with a running time 
of O(cn), compared to 0(n 2 ) for the traditional SC method, it is more efficient for large 
datasets. In addition, fzSC can be integrated easily with fuzzy clustering algorithms to 
search for the best centers of cluster candidates, or to address other existing problems of 
FCM such as missing data and fuzzifier determination problems. 

We combined FCM with a new SC method to automatically determine the 
number of clusters in a dataset. FCM randomly creates a set of fuzzy partition solutions 
for the dataset using a maximum number of clusters; SC then uses the fuzzy partition 
approach to search for the best cluster centers and the optimal number of clusters for 
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each solution. FCM then rapidly determines the best fuzzy partition of every solution 
using the optimal number of clusters determined by SC. The best solution from this 
solution set is the final result. 

4.2.3 Fuzzy clustering evaluation method 

Recently, many cluster validity indices have been developed to evaluate results 

of FCM. Bezdek [92] measured performance using partition entropy and the overlap of 
adjacent clusters. Fukuyama and Sugeno [93] combined the FCM objective function 
with the separation factor, while Xie and Beni [94] integrated the Bezdek index [92] 
with the cluster separation factor. Rezaee et al. [95] combined the compactness and 
separation factors, and Pakhira et al. [96] combined the same two factors where the 
separation factor was normalized. Recently, Rezaee [97] proposed a new cluster index 
in which the two factors were normalized across the range of possible numbers of 
clusters. These were introduced in Section 3.5.1. 

Here, we propose a fuzzy cluster validation method that uses the fuzzy partition 
and the distance matrix between cluster centers and data points. Instead of compactness 
and separation, fzBLE uses a Bayesian model and a log-likelihood estimator. With the 
use of both the possibility model and the probability model to represent the data 
distribution, fzBLE is appropriate for artificial datasets where the distribution follows a 
standard model, as well as for real datasets that lack a standard distribution. We show 
that fzBLE outperforms popular cluster indices on both artificial and biological datasets. 
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Method design - Bayesian & likelihood based approach 

Instead of using the two factors, compactness and separation, we propose a 
validation method (fzBLE) that is based on a log likelihood estimator with a fuzzy 
based Bayesian model. 

Given a fuzzy clustering solution modeled by = {U, V}, where V represents 
the cluster centers and U is the partition matrix representing the membership degrees of 
the data points to the clusters. The likelihood of the clustering model and the data is 
measured as 



L(0|X) = L(U,V|X) 

=np(x l iu,v)=nxp(v k )xp(x l iv k ), 

i=l i=l k=l 

where the log likelihood is computed using the following, 

log(L) = ]Tlog|XP(v k )xP( Xi |v k ) 

i=l Vk=l J 



(4.20) 



^max. (4.21) 



Because our clustering model is possibility-based, to apply Equations (4.20) and 
(4.21), a transformation from possibility to probability is needed. Given a fuzzy 
clustering model = {U, V}, according to [98], Uki is the possibility that Vk = xi. If is a 
proper fuzzy partition, then there exists some x such that Uk(x ) = 1, k=l,. . .,c, and Uk 
is a normal possibility distribution. Assume Pk is the probability distribution of Vk on X, 
where pki > pk2 > pk3 >• • . > pkn- We associate with Pk a possibility distribution Uk on X 
[98] such that u^ is the possibility of Xi, where 
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u k „=nxp kn , 

u kl =i(Pki-Pk,i + i)+u k;i+1 , i = (n -!),...,!. 



(4.22) 



Reversing (4.22), we obtain the transformation of a possibility distribution to a 
probability distribution. Assume that Uk is ordered the same way with Pk on X: 



U k i > Uk2 > U k3 >. . > U kn . 

P kn =u kn /n, 
Pki=Pk, 1+ i+(u ki -u kji+1 )/i. 



(4.23) 



Pk is an approximate probability distribution of Vk on X, and pki = P(xi|vk). If Uk 
is a normal possibility distribution then Zpki = 1 . 

The data distributions 

Having computed the value of Pk, we can estimate the variance <Tk, the prior 
probability P(vk), and the normal distribution of Vk for cluster k [27]. 



V II II 2 



(4.24) 



i=l 



iP(XiK) 



P(v k )=" 



ZlPCxJv,) 



(4.25) 



1=1 i=l 



Pp(xjv k ) = 



|x,-v k | : 



2 A 



(2<> 1/p XG k xe 



(4.26) 



In real datasets, for a cluster Vk, the data points usually come from different 
random distributions. Because they cluster in Vk, they tend to follow the normal 
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distribution of Vk estimated as in (4.26). This idea is based on the Central Limit 
Theorem. 

Theorem 4.1 (Central Limit Theorem): S n is the sum of n mutually independent random 
variables, then the distribution function of S n is well-approximated by a certain type of 
continuous function known as a normal density function, which is given by the formula 

i ( x ~^) 2 



f^W = — s=e ^ , (4.27) 

GV27T 



where |u and a are the mean and variance of S n respectively. We integrate the 
probabilities computed in (4.23) and (4.27) for the probability of the data point xi given 
cluster Vk as 

P*(x, | v k ) = max{P(x 1 | v k ),P n (x 1 | v k )}. (4.28) 

Equation (4.28) better represents the data distribution, particularly in real datasets. Our 
method, fzBLE [147], is based on (4.21), (4.26) and (4.28). 

Method testing 

We tested fzBLE with other well-known cluster indices: partition coefficient - 
PC (3.23) and partition entropy - PE (3.24) [92], Fukuyama-Sugeno - FS (3.25) [93], 
Xie and Beni - XB (3.26) [94], Compose Within and Between scattering - CWB (3.28) 
[95], Pakhira-Bandyopadhyay-Maulik cluster index - PBMF (3.29) [96], and Babak 
Rezaee cluster index - BR (3.30) [97] using the artificial and real datasets. 

For each artificial dataset, we ran the standard FCM algorithm five times with m 
set to 2.0 and the partition matrix initialized randomly. In each case, the best fuzzy 



124 



partition was then selected to run fzBLE and the other cluster indices to search for the 
optimal number of clusters using numbers from 2 to 12 and to compare this with the 
known number of clusters. We repeated the experiment 20 times and averaged the 
performance of each method. Table 4.8 shows the fraction of correct predictions. fzBLE 
and PBMF outperform other approaches. The CF method, which is based on the 
compactness factor, was the least effective. 

Table 4.8 : Fraction of correct cluster predictions on artificial datasets 



#c 


fzble 


PC 


PE 


FS 


XB 


CWB 


PBMF 


BR 


CF 


3 


1.00 


0.42 


0.42 


0.42 


0.42 


1.00 


1.00 


0.83 


0.00 


4 


1.00 


0.92 


0.92 


0.92 


0.83 


1.00 


1.00 


1.00 


0.00 


5 


1.00 


0.75 


0.75 


0.83 


0.75 


0.83 


1.00 


1.00 


0.00 


6 


1.00 


0.92 


0.83 


0.92 


0.58 


0.58 


1.00 


0.92 


0.00 


7 


1.00 


0.83 


0.83 


0.83 


0.67 


0.58 


1.00 


0.67 


0.00 


8 


1.00 


1.00 


0.92 


1.00 


0.92 


0.67 


1.00 


0.83 


0.00 


9 


1.00 


0.92 


0.67 


0.92 


0.67 


0.33 


1.00 


0.83 


0.00 



Table 4.9 : Validation method performance on the Iris dataset (3 true clusters) 



#c 


fzble 


PC 


PE 


FS 


XB 


CWB 


PBMF 


BR 


CF 


2 


-763.0965 


0.9554 


0.0977 


-10.6467 


0.0203 


177.1838 


12.3280 


1.1910 


0.9420 


3 


-762.8034 


0.8522 


0.2732 


-9.3369 


0.1292 


213.4392 


17.7131 


1.0382 


0.3632 


4 


-764.8687 


0.7616 


0.4381 


-7.4821 


0.2508 


613.2656 


14.4981 


1.1344 


0.2665 


5 


-770.2670 


0.6930 


0.5703 


-8.2331 


0.3473 


783.4697 


13.6101 


1.0465 


0.1977 


6 


-773.6223 


0.6549 


0.6702 


-7.3202 


0.2805 


904.3365 


12.3695 


1.0612 


0.1542 


7 


.774.4740 


0.6155 


0.7530 


-6.8508 


0.2245 


1029.7342 


11.2850 


0.9246 


0.1262 


8 


-774.8463 


0.6000 


0.8111 


-6.9273 


0.3546 


1635.3593 


10.5320 


0.8692 


0.1072 


9 


-780.1901 


0.5865 


0.8556 


-6.6474 


0.3147 


1831.5705 


9.9357 


0.7653 


0.0905 


10 


-781.7951 


0.5765 


0.8991 


-6.0251 


0.2829 


2080.3339 


9.3580 


0.7076 


0.0787 
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Tables 4.9 - 4.11 show the test results on the Iris, Wine and Glass datasets. For 
the Iris dataset, only fzBLE and PBMF detected the correct number of clusters (Table 
4.9). For the Wine and Glass datasets, only fzBLE and CWB, and only fzBLE, 
respectively, detected the correct number of clusters (Tables 4.10 and 4.1 1). 

Table 4.10 : Validation method performance on the Wine dataset (3 true clusters) 



#c 


fzble 


PC 


PE 


FS 


XB 


CWB 


PBMF 


BR 


CF 


2 


-926.4540 


0.9264 


0.1235 


-113.0951 


0.1786 


3.9100 


1.3996 


2.0000 


61.1350 


3 


-924.0916 


0.8977 


0.1764 


-104.9060 


0.2154 


3.2981 


0.9316 


1.4199 


39.3986 


4 


-932.8377 


0.8607 


0.2525 


-139.9144 


0.5295 


6.6108 


0.6306 


1.1983 


33.7059 


5 


-929.6146 


0.8225 


0.3281 


-126.5746 


0.5028 


6.9001 


0.4700 


1.0401 


28.4741 


6 


-928.8121 


0.8066 


0.3669 


-118.4715 


0.6173 


9.2558 


0.3706 


0.9111 


25.3451 


7 


-930.6451 


0.7988 


0.3874 


-120.3128 


0.6465 


10.3803 


0.2972 


0.7629 


23.1742 


8 


-932.0462 


0.7993 


0.3917 


-124.7999 


0.6459 


11.0836 


0.2471 


0.6392 


21.4411 


9 


-932.1902 


0.7929 


0.4120 


-122.8396 


0.6367 


11.8373 


0.2100 


0.5801 


19.9154 


10 


-935.0478 


0.7909 


0.4217 


-130.9089 


0.6270 


11.9941 


0.1773 


0.5252 


18.9891 



Table 4.1 1 : Validation method performance on the Glass dataset (6 true clusters) 



#c 


fzble 


PC 


PE 


FS 


XB 


CWB 


PBMF 


BR 


CF 


2 


-1135.6886 


0.8884 


0.1776 


0.3700 


0.7222 


6538.9311 


0.3732 


1.9817 


0.5782 


3 


-1127.6854 


0.8386 


0.2747 


0.1081 


0.7817 


4410.3006 


0.4821 


1.5004 


0.4150 


4 


-1119.2457 


0.8625 


0.2515 


-0.0630 


0.6917 


3266.5876 


0.4463 


1.0455 


0.3354 


5 


-1123.2826 


0.8577 


0.2698 


-0.1978 


0.6450 


2878.8912 


0.4610 


0.8380 


0.2818 


6 


-1113.8339 


0.8004 


0.3865 


-0.2050 


1.4944 


5001.1752 


0.3400 


0.8371 


0.2430 


7 


-1116.5724 


0.8183 


0.3650 


-0.2834 


1.3802 


5109.6082 


0.3891 


0.6914 


0.2214 


8 


-1127.2626 


0.8190 


0.3637 


-0.3948 


1.4904 


7172.2250 


0.6065 


0.5916 


0.2108 


9 


-1117.7484 


0.8119 


0.3925 


-0.3583 


1.7503 


8148.7667 


0.3225 


0.5634 


0.1887 


10 


-1122.1585 


0.8161 


0.3852 


-0.4214 


1.7821 


9439.3785 


0.3909 


0.4926 


0.1758 


11 


-1121.9848 


0.8259 


0.3689 


-0.4305 


1.6260 


9826.4211 


0.3265 


0.4470 


0.1704 


12 


-1135.0453 


0.8325 


0.3555 


-0.5183 


1.4213 


11318.4879 


0.5317 


0.3949 


0.1591 


13 


-1138.9462 


0.8317 


0.3556 


-0.5816 


1.4918 


14316.7592 


0.6243 


0.3544 


0.1472 
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For the Yeast dataset, we ran the FCM algorithm with m set to 1.17 [60] and 
used the clustering partition to test all methods as in previous sections. Table 4.12 
shows that only fzBLE detected the correct number of clusters (five) in Yeast dataset. 

Table 4.12 : Validation method performance on the Yeast dataset (5 true clusters) 



#c 


fzble 


PC 


PE 


FS 


XB 


CWB 


PBMF 


BR 


CF 


2 


-2289.8269 


0.9275 


0.1172 


-85.1435 


0.2060 


8.3660 


1.2138 


2.0000 


133.0734 


3 


-2296.4502 


0.9419 


0.0983 


-157.2825 


0.2099 


4.7637 


0.6894 


1.0470 


94.6589 


4 


-2305.3369 


0.9437 


0.1000 


-191.7664 


0.2175 


4.0639 


0.5575 


0.7240 


74.7629 


5 


-2289.3070 


0.9087 


0.1648 


-187.1073 


1.0473 


13.6838 


0.4087 


0.6722 


65.9119 


6 


-2296.3098 


0.8945 


0.1939 


-196.6711 


0.9932 


13.8624 


0.3050 


0.6170 


60.8480 


7 


-2296.6017 


0.8759 


0.2299 


-198.2858 


1.0558 


15.4911 


0.2434 


0.5686 


56.1525 


8 


-2299.4225 


0.8634 


0.2526 


-201.7688 


1.0994 


16.9644 


0.2050 


0.5132 


51.2865 


9 


-2299.3653 


0.8453 


0.2871 


-205.1489 


1.2340 


20.2532 


0.1741 


0.4819 


48.0737 


10 


-2302.7581 


0.8413 


0.2992 


-208.5687 


1.1947 


20.7818 


0.1512 


0.4533 


45.9442 


11 


-2300.3294 


0.8325 


0.3186 


-209.4023 


1.1731 


21.1525 


0.1307 


0.4272 


43.6600 


12 


-2307.5701 


0.8290 


0.3272 


-213.4658 


1.2245 


23.0389 


0.1157 


0.4040 


42.1594 


13 


-2310.7819 


0.8270 


0.3354 


-215.2463 


1.3036 


25.4062 


0.1016 


0.3847 


40.8654 



Table 4.13 : Validation method performance on the Yeast-MIPS dataset (4 true clusters) 



#c 


fzble 


PC 


PE 


FS 


XB 


CWB 


PBMF 


BR 


CF 


2 


-1316.4936 


0.9000 


0.1625 


25.4302 


0.3527 


16.7630 


0.7155 


1.9978 


81.0848 


3 


-1317.3751 


0.9092 


0.1615 


-32.8476 


0.2981 


10.1546 


0.8032 


1.2476 


58.2557 


4 


-1304.0374 


0.8216 


0.3252 


-39.4858 


2.5297 


39.8434 


0.5400 


1.3218 


48.6275 


5 


-1308.6776 


0.8279 


0.3216 


-54.4979 


2.4245 


34.9963 


0.3620 


0.9558 


41.9671 


6 


-1309.9191 


0.8211 


0.3460 


-59.8918 


2.3511 


35.4533 


0.2691 


0.8291 


38.5468 


7 


-1315.3692 


0.8139 


0.3654 


-65.4866 


2.3562 


38.8797 


0.2423 


0.7252 


36.0906 


8 


-1315.1479 


0.8062 


0.3918 


-67.6774 


2.4958 


43.9502 


0.1966 


0.6712 


34.1387 


9 


-1321.2280 


0.8109 


0.3874 


-72.3197 


2.2854 


41.2112 


0.1664 


0.6072 


32.3289 


10 


-1324.1578 


0.8158 


0.3847 


-74.7867 


2.0433 


37.6154 


0.1395 


0.5588 


30.9686 
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For the Yeast-MIPS dataset, we ran the FCM algorithm using the same 
parameters as with the Yeast dataset. The results in Table 4.13 show that only fzBLE 
correctly detected the four clusters in the Yeast-MIPS dataset. 

For the RCNS (Rat Central Nervous System) dataset, we ran fzBLE and the 
other cluster indices on the dataset clustering partition found by the standard FCM 
algorithm using the Euclidean metric for distance measurement. Table 4.14 shows that, 
again, only fzBLE detected the correct number of clusters. 

Table 4.14 : Validation method performance on the RCNS dataset (6 true clusters) 



#c 


fzble 


PC 


PE 


FS 


XB 


CWB 


PBMF 


BR 


CF 


2 


-580.0728 


0.9942 


0.0121 


-568.7972 


0.0594 


5.5107 


4.2087 


1.1107 


177.8094 


3 


-564.1986 


0.9430 


0.0942 


-487.6104 


0.4877 


4.1309 


4.2839 


1.6634 


117.9632 


4 


-561.0169 


0.9142 


0.1470 


-430.4863 


0.9245 


6.1224 


3.3723 


1.3184 


99.1409 


5 


-561.7420 


0.8900 


0.1941 


-397.0935 


1.3006 


9.4770 


2.6071 


1.1669 


88.5963 


6 


-552.9153 


0.8695 


0.2387 


-300.6564 


2.5231 


20.6496 


1.9499 


1.1026 


84.0905 


7 


-556.2905 


0.8707 


0.2386 


-468.3121 


2.1422 


21.0187 


2.8692 


0.7875 


57.5159 


8 


-555.3507 


0.8925 


0.2078 


-462.0673 


1.7245 


20.0113 


2.5323 


0.5894 


52.0348 


9 


-558.8686 


0.8863 


0.2192 


-512.4278 


1.6208 


22.4772 


2.6041 


0.5019 


45.9214 


10 


-565.8360 


0.8847 


0.2241 


-644.1451 


1.1897 


21.9932 


3.4949 


0.3918 


33.1378 



We presented a novel method that uses the log likelihood estimator with a 
Bayesian model and the possibility, rather than the probability, distribution model of the 
dataset from the fuzzy partition. By using the Central Limit Theorem, fzBLE effectively 
represents distributions in real datasets. Results have shown that fzBLE performs 
effectively on both artificial and real datasets. 
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4.2.4 Fuzzy clustering evaluation using Gene Ontology [180] 

The shortcoming of fzBLE is that it is solely based on the data and, therefore, 
unable to apply prior biological knowledge into cluster evaluation. We developed 
GOfzBLE, a cluster validation method that generates the fuzzy partition for a given 
clustering solution using GO term based semantic similarity and applies fzBLE method 
to evaluate the clustering solution using the fuzzy partition. 

Gene Ontology 

The Gene Ontology (GO) [181] is a hierarchy of biological terms using a 
controlled vocabulary that includes three independent ontologies for biological process 
(BP), molecular function (MF) and cellular component (CC). Standardized terms known 
as GO terms describe roles of genes and gene products in any organism. GO terms are 
related to each other in the form of parent-child relationships. A gene product can have 
one or more molecular functions, can participate in one or more biological processes, 
and can be associated with one or more cellular components [182]. As a way to share 
knowledge about functionalities of genes, GO itself does not contain gene products of 
any organism. Rather, expert curators specialized in different organisms annotate 
biological roles of gene products using GO annotations. Each GO annotation is assigned 
with an evidence code that indicates the type of evidence supporting the annotation. 

Semantic similarity 

GO is structured as directed acyclic graphs (DAGs) in which the terms form 
nodes, and the two kinds of semantic relationships, "is-a" and "part-of , form edges 
[184]. "is-a" is a simple class-subclass relation, where A is-a B means that A is a 
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subclass of B. 'part-of ' is a partial ownership relation; C part-of D means that 
whenever C is present, it is always a part of D, but C need not always be present. The 
structure of DAG allows assigning a metric to a set of terms based on the likeliness of 
their meaning content which is used to measure semantic similarity between terms. 
Multiple GO based semantic similarity measures have been developed [183, 184], and 
are increasingly used to evaluate the relationships between proteins in protein-protein 
interactions, or co-regulated genes in gene expression data analysis. Among of the 
existing semantic similarity measurement methods, that of Resnik is most appropriate 
for gene expression data analysis because it is strong correlated with gene sequence 
similarities and gene expression profiles [183]. However, Wang et al. [184] had shown 
that the Resnik' s method has a drawback in that it uses only the information content 
derived from annotation statistics which is not suitable for measuring semantic 
similarity of GO terms. We therefore propose to use Wang's method for GO semantic 
similarity measurement. For each term A, a semantic value, S(A), is computed as: 

S(A)=£S A (t), (4.29) 

teT A 

where T A is a set of terms including the term A and its ancestors, and Sa(.) is the 
sematic value regarding the term A, defined as in (4.30), 

S A (t)H > ( 4 - 3 °) 

AV [max{w?xS A (u),ueChildrenOf(t)},t*A 

where w" is the semantic contribution factor [184] for edge connecting term t with its 
child, term u. The semantic similarity between two terms, A and B, is defined as: 
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X(S A (t) + S B (t)) 

S GO (A,B) = ^^ . (4.31) 

GO SV(A) + SV(B) 

Using the semantic similarities between GO terms, the semantic similarity 
between the two GO term sets, Gl and G2, is defined as: 

£Sim(gl,G2)+ £sim(g2,Gl) 

Sim(Gl,G2) = ^i g2eG2 , (4.32) 

Gl + G2 

where Sim(t,T) is the similarity between the term t and the term set T and is defined as 
in (4.33), 

Sim(t, T) = max{S G0 (t, u)} . (4.33) 

ueT 

Method design 

For a given crisp clustering solution of gene expression data, an approximate 
fuzzy partition is generated based on the gene-gene semantic similarities using GO term 
annotations. The fuzzy partition is applied to the fzBLE method for evaluation of the 
clustering solution. Each crisp clustering solution is modeled with C = {M, V}, where 
V represents the cluster centers and M, M={mki}; niki e {0,1}; k=l,...,c; i=l,...,n, is the 
crisp c-partition matrix representing the membership of the data points to the clusters. In 
case where a fuzzy clustering solution 0, = {U, V}, is given, C can be computed 
based on using our defuzzification method for fuzzy partition [154]. 

Fuzzification of a crisp clustering partition- In order to use GO terms to validate 
crisp clustering solution, a fuzzy partition is generated based on the crisp partition using 
GO semantic similarity. This is done by assigning the crisp partition into the space of 
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GO terms. Each data object xi, i=l,..., n, is corresponding with a vector of GO 
annotations x G0 , and a vector of degrees of belief (DOB) x CF , where x CF is the DOB 

of the term x G0 annotated to xi (Table 4.15). Because GO annotations of a gene/gene 

product may come from different sources (Table 4.16), use of DOBs can help with 
combining multiple annotations using the same term of the same data object. 

Table 4.15 : Degrees of belief of GO annotation evidences 



Evidence 


Degree of belief 


EXP 


1.0 


IDA, IPI, TAS 


0.9 


IMP, IGI, IEP 


0.7 


ISS, ISO, ISA, ISM 


0.4 


IGC 


0.2 


IBA, IBD, IKR, IRD, RCA 


0.3 


NAS, IC, ND, NR 


0.0 


IEA 


0.1 



The GO annotations for each cluster Vk, v k ° for k=l,..., c, can be determined 
using the GO annotations of the members of cluster Vk: 

v k G0 = U x G0 (4.34) 



GO 



r CF 



For each annotation v kt of Vk, t=l,..., v k , its DOB, v kt , is computed using the 
DOBs of cluster's members annotated with term t: 



v- =mean(xf ,x e v k ,x G ° = v°°J (4.35) 
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Table 4.16 : Gene Ontology evidence codes 



Experimental Evidence Codes 


EXP 


Inferred from Experiment 


IDA 


Inferred from Direct Assay 


IPI 


Inferred from Physical Interaction 


IMP 


Inferred from Mutant Phenotype 


IGI 


Inferred from Genetic Interaction 


IEP 


Inferred from Expression Pattern 


Computational Analysis Evidence Codes 


ISS 


Inferred from Sequence or Structural Similarity 


ISO 


Inferred from Sequence Orthology 


ISA 


Inferred from Sequence Alignment 


ISM 


Inferred from Sequence Model 


IGC 


Inferred from Genomic Context 


IBA 


Inferred from Biological aspect of Ancestor 


IBD 


Inferred from Biological aspect of Descendant 


IKR 


Inferred from Key Residues 


IRD 


Inferred from Rapid Divergence 


RCA 


inferred from Reviewed Computational Analysis 


Author Statement Evidence Codes 


TAS 


Traceable Author Statement 


NAS 


Non-traceable Author Statement 


Curator Statement Evidence Codes 


IC 


Inferred by Curator 


ND 


No biological Data available 


Automatically-assigned Evidence Codes 


IEA 


Inferred from Electronic Annotation 


Obsolete Evidence Codes 


NR 


Not Recorded 
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We apply the semantic similarity measure as in (4.32) to compute the semantic 
similarity between cluster Vk and data point xi using their GO annotations. We propose a 
modification of (4.31) so that it can be used with DOB. For the term v£° of cluster Vk 

and the term x?° of data point xi, their semantic similarity is defined as: 

Qf / TT GO ^GO\ c S^rGO v GO\ WWk - s CF v CF \ s A ^r\ 

SooKt > x ij ) = S Go( v kt > x ij ) xmm (v kt9 x 1J ) (4-36) 

We use (4.36) instead of (4.31) when computing the semantic similarity between Vk and 
Xi, Sim^v^Xi), using (4.32). The distance between Vk and Xi based on GO semantic 
similarity is then defined as: 

d^(v k ,x 1 ) = d 2 (l,Sim'(v k ^ 1 )). (4.37) 

A fuzzy partition model using GO terms space 0go, 0go = {Ugo, _}, of C is now 
determined where Uqo is computed as in (4.9) using the distance function defined in 
(4.37). 0qo is used with fzBLE for the clustering solution evaluation. Because 0go is 
based on GO annotation, we proposed a method to compute the prior probability of Vk, 
Pp(vk), k=l,..., c, using the prior probabilities of the GO terms used by Vk, T. For a 
given term t, teT, 

sup (v£) 

P(t|v k ) = U£V -7 c . , (4.39) 

sum(v k ) 
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where P p (t) and P(t|vk) are the prior probability of the term t and the conditional 
probability of the term t given Vk respectively. The prior probability of Vk, k=l,..., c, is 
computed as in (4.40), 



P p (v k )=EP(t,v k )xP p (t) 

= XP(t|v k )xP(v k )xP p (t) 



(4.40) 



Method testing 



We tested GOfzBLE with fzBLE and the other cluster indices as in the Section 



4.2.3 using the Yeast, Yeast-MIPS and RCNS datasets. 



Table 4. 17 : Validation method performance on the Yeast dataset using GO-BP 



#c 


GOfzBLE 


PC 


PE 


FS 


XB 


CWB 


PBMF 


BR 


fzBLE 


2 


-4832.656 


0.928 


0.117 


-85.144 


0.206 


8.366 


1.214 


2.000 


-2289.827 


3 


-4832.542 


0.942 


0.098 


-157.283 


0.210 


4.764 


0.689 


1.047 


-2296.450 


4 


-4832.751 


0.944 


0.100 


-191.766 


0.218 


4.064 


0.558 


0.724 


-2305.337 


5 


-4831.548 


0.909 


0.165 


-187.107 


1.047 


13.684 


0.409 


0.672 


-2289.307 


6 


-4831.890 


0.895 


0.194 


-196.671 


0.993 


13.862 


0.305 


0.617 


-2296.310 


7 


-4833.182 


0.876 


0.230 


-198.286 


1.056 


15.491 


0.243 


0.569 


-2296.602 


8 


.4833.644 


0.863 


0.253 


-201.769 


1.099 


16.964 


0.205 


0.513 


-2299.423 


9 


-4832.358 


0.845 


0.287 


-205.149 


1.234 


20.253 


0.174 


0.482 


-2299.365 


10 


-4832.246 


0.841 


0.299 


-208.569 


1.195 


20.782 


0.151 


0.453 


-2302.758 


11 


-4832.600 


0.833 


0.319 


-209.402 


1.173 


21.153 


0.131 


0.427 


-2300.329 


12 


-4832.668 


0.829 


0.327 


-213.466 


1.225 


23.039 


0.116 


0.404 


-2307.570 


13 


-4832.217 


0.827 


0.335 


-215.246 


1.304 


25.406 


0.102 


0.385 


-2310.782 



For the Yeast dataset, only GOfzBLE and fzBLE correctly identified the number 
of clusters (Table 4.17). Table 4.18 again shows that these two algorithms detected 
correctly the number of clusters. While fzBLE works based on gene expression levels, 
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GOfzBLE works based on GO-BP annotations. The results also show that GO-BP 
annotations are strongly correlated with gene-gene differential co-expression patterns. 
In other words, we may utilize either gene-gene co-expression patterns in gene 
expression data or GO-BP annotations to search for genes having similar functionality. 

Table 4.18 : Validation method performance on the Yeast-MIPS dataset using GO-BP 



#c 


GOfzBLE 


PC 


PE 


FS 


XB 


CWB 


PBMF 


BR 


fzBLE 


2 


-2288.413 


0.900 


0.163 


25.430 


0.353 


16.763 


0.716 


1.998 


-1316.494 


3 


-2286.843 


0.909 


0.162 


-32.848 


0.298 


10.155 


0.803 


1.248 


-1317.375 


4 


-2283.854 


0.822 


0.325 


-39.486 


2.530 


39.843 


0.540 


1.322 


-1304.037 


5 


-2285.069 


0.828 


0.322 


-54.498 


2.425 


34.996 


0.362 


0.956 


-1308.678 


6 


-2286.252 


0.821 


0.346 


-59.892 


2.351 


35.453 


0.269 


0.829 


-1309.919 


7 


-2286.834 


0.814 


0.365 


-65.487 


2.356 


38.880 


0.242 


0.725 


-1315.369 


8 


-2287.543 


0.806 


0.392 


-67.677 


2.496 


43.950 


0.197 


0.671 


-1315.148 


9 


-2288.333 


0.811 


0.387 


-72.320 


2.285 


41.211 


0.166 


0.607 


-1321.228 


10 


-2288.954 


0.816 


0.385 


-74.787 


2.043 


37.615 


0.140 


0.559 


-1324.158 



Table 4.19 : Validation method performance on the RCNS dataset using GO-CC 



#c 


GOfzBLE 


PC 


PE 


FS 


XB 


CWB 


PBMF 


BR 


fzBLE 


2 


-970.340 


0.994 


0.012 


-568.797 


0.059 


5.511 


4.209 


1.111 


-580.073 


3 


-970.460 


0.943 


0.094 


-487.610 


0.488 


4.131 


4.284 


1.663 


-564.199 


4 


-969.909 


0.914 


0.147 


-430.486 


0.925 


6.122 


3.372 


1.318 


-561.017 


5 


-969.669 


0.890 


0.194 


-397.094 


1.301 


9.477 


2.607 


1.167 


-561.742 


6 


-969.659 


0.870 


0.239 


-300.656 


2.523 


20.650 


1.950 


1.103 


-552.915 


7 


-970.148 


0.871 


0.239 


-468.312 


2.142 


21.019 


2.869 


0.788 


-556.291 


8 


-969.787 


0.893 


0.208 


-462.067 


1.725 


20.011 


2.532 


0.589 


-555.351 


9 


-970.300 


0.886 


0.219 


-512.428 


1.621 


22.477 


2.604 


0.502 


-558.869 


10 


-970.487 


0.885 


0.224 


-644.145 


1.190 


21.993 


3.495 


0.392 


-565.836 
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The RCNS dataset has six groups of similarly expressed genes, two of which are 
invariant [87]. We first ran GOfzBLE using GO-CC annotations and compared with the 
other methods. The results are shown in Table 4.19. Only GOfzBLE and fzBLE 
identified six clusters in the dataset, corresponding to the six groups in the analysis 
results performed by Wen et al. [87]. We then reran GOfzBLE using GO-BP 
annotations and compared with the other methods. The results are shown in Table 4.20. 
GOfzBLE identified four clusters in the dataset that are corresponding to the four stages 
in the developmental process. The result of GOfzBLE again shows that GO-BP 
annotations are strongly correlated with gene-gene differential co-expression patterns. It 
also shows that GO-BP annotations are useful in creating effective external criteria for 
cluster analysis of gene expression data. 

Table 4.20 : Validation method performance on the RCNS dataset using GO-BP 



#c 


GOfzBLE 


PC 


PE 


FS 


XB 


CWB 


PBMF 


BR 


fzBLE 


2 


-1373.999 


0.994 


0.012 


-568.797 


0.059 


5.511 


4.209 


1.111 


-580.073 


3 


-1373.935 


0.943 


0.094 


-487.610 


0.488 


4.131 


4.284 


1.663 


-564.199 


4 


-1373.776 


0.914 


0.147 


-430.486 


0.925 


6.122 


3.372 


1.318 


-561.017 


5 


-1374.208 


0.890 


0.194 


-397.094 


1.301 


9.477 


2.607 


1.167 


-561.742 


6 


-1374.496 


0.870 


0.239 


-300.656 


2.523 


20.650 


1.950 


1.103 


-552.915 


7 


-1374.526 


0.871 


0.239 


-468.312 


2.142 


21.019 


2.869 


0.788 


-556.291 


8 


-1374.811 


0.893 


0.208 


-462.067 


1.725 


20.011 


2.532 


0.589 


-555.351 


9 


-1375.199 


0.886 


0.219 


-512.428 


1.621 


22.477 


2.604 


0.502 


-558.869 


10 


-1375.426 


0.885 


0.224 


-644.145 


1.190 


21.993 


3.495 


0.392 


-565.836 



4.2.5 Imputation methods for partially missing data 

As discussed earlier in Section 3.4, given a data point Xi with missing attributes, 
the global model based approach updates the missing values of Xi using the information 
of the whole data. In contrast, the local model based approach does that using the 
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information of the data points in the neighborhood of xi. The former has an advantage of 
eliminating the noise that can be caused by the data points around xi, particularly in the 
case when xi is in the overlapping region. The latter, however, has the advantage that it 
may provide the data distribution model that xi is following. 

Recent research has shown that the global model based approach, the optimal 
completion strategy (OCS) is an example, outperforms the local model based one [139, 
140, 143]. Existing approaches usually update X M using either a global model, or a local 
model. Luo et al. [141], by using a local update model integrated into the FCM 
algorithm, showed that his method outperformed two well-known methods: K-nearest 
neighborhood (KNN) and Local Least Squares K-nearest neighborhood (SKNN). 
However, the method of Luo et al. is only based on the local model information. Hence, 
it may fail at the overlapping region. Mohammadi et al. [142] improve the work of Luo 
et al. by using an external distance based on GO terms. The relationships of the terms in 
GO in fact are locally detected, methods using them are just a kind of local model. 

Regarding the problem of missing data, we developed separately two new 
algorithms using our fzBLE and fzSC methods, namely probability based (fzPBI) and 
density based (fzDBI) imputation algorithms, to create "abstract" fuzzy partition based 
on the partition derived from the partially distance strategy method (PDS) [140]. These 
abstract partitions not only contain the global distribution model of the data, but they 
also represent the local distributions at every data points in the dataset. We then 
developed new update models for X M using both the global and local data distribution 
models derived from the abstract partitions [154, 155]. 
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4.2.6 Probability based imputation method (fzPBI) [154] 

Method design 

The objective of fzPBI is to cluster a dataset X, that may be incomplete, into c 
clusters, X m are estimated during the clustering process with respect to the optimization 
of the function J m . Let X w = {xigX | Xi is a complete data object}, X P = {xy | Xy ^ ?}, 
and X M = {xy | xy= ?}. The distance measure, d (.) therefore is defined as: 

where w,- indicates the contribution degree of the j th attribute of xi, Xy, in the distance 
between xi and Vk. If Xy e X m , w,- increases with each iteration to avoid premature use of 
estimated values. Therefore, we define Wj as: 

f 1 x, e X p 

W Ht/T Y ' (4 ' 42) 

[t/T x { . g X M 

where t is the iteration index, and T is the number of iterations; < t < T. 

To impute the values of Xm, we applied our fzBLE method [148] to generate a 
probabilistic model of the data distribution using the fuzzy partition. This model is then 
used to impute the missing values. For each cluster Vk, k=l,...,c, a probability 
distribution {pki}i=i,...,n is derived from the possibility distribution {uki}i=i,..., n as in 
(4.22) and (4.23). The data distributions model is then estimated using (4.24), (4.25) 
and (4.26). The model for missing data estimation is the developed as follows: 
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Let P k m ={Pki>Pk2>--->Pw,} ^ e a set of probabilities at v k , where p£L j=l,...,p, 



indicates the probability that attribute j is missing in cluster k, defined as: 



t^^-h) 



n m = ■ i=1 

Pki 



IP(x„v k ) 



(4.43) 



i=l 



where 



Jlx y eX p 
IJ 10x s eX M 



Hence, the probability that a data object xi has a missing attribute j in cluster k 



is: 



P(x M(ij)|k ) = P(v k )p™ 



Ffp£ 

t=l,t*j 



P(x,|v k ). 



(4.44) 



Because each cluster Vk can be considered a component of the data distribution 
model of X, the estimated value of x y - e X M is computed as: 



X ij _ ^^V X M(ij)|k fk / Zj^V X M(ij)|k )• 



(4.45) 



k=l 



k=l 



To avoid an oscillation in the missing value imputation, estimated values are 
used decreasingly. In contrast to their usage in the clustering process, their contribution 
degree in distance measurement during the estimation process is defined as, 



1 X^GXp 

J ll-t/T Xlj eX M 



(4.46) 
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fzPBI algorithm 
Steps 

1) t=0; initialize U t randomly w.r.t (3.2). 

2) Compute {d 2 (x i9 v k )}i=i.. n ,k=i..c as in (4.41) and (4.42). 

3 ) Compute U t+ 1 using (3.3). 

4) Compute V t+1 using (3.4). 

5) If (t > T) or (|j^ +1 - r m \\ < c) then Stop. 

6) Compute {d 2 (xi, Vk)}i=i.. n ,k=i..c using w- as in (4.46). 

7) Create a probabilistic data distribution model from the fuzzy partition using 
(4.24), (4.25) and (4.26). 

8) Create a probabilistic model for X M using (4.43) and (4.44). 

9) Estimate X M using (4.45). t=t+l. Go to Step 2. 

The difference between fzPBI and the standard FCM algorithm is that fzPBI 
provides a method to discover the probabilistic model of the data distribution from the 
fuzzy partition and to apply the model to missing value imputation. fzPBI therefore can 
address the problem of missing data itself. 

Algorithm performance measures 

We used two measures to evaluate algorithm performance. The first measure is 
the root mean square error (RMSE) between the true values and the imputed values, 
defined as: 
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RMSE= U^fc-xJ 



(4.47) 



where ni is the number of missing values imputed. The second measure assesses the 
overall performance determined by the number of data objects with missing attributes 
that were misclassified. This assessment is done by comparing the cluster label of each 
data object with its actual class label. If the two match, there is no misclassification. If 
they do not match, then a misclassification has occurred, defined in (3.21). 

Method testing 
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Figure 4-6 : Average RMSE of 50 trials using an incomplete ASET2 dataset with 
different percentages of missing values 

To evaluate the performance of fzPBI, we compared fzPBI with six popular 
imputation methods: PDS, OCS and NPS [140]; FCMimp [158]; CIAO [159]; and 
FCMGOimp [160]. For each test dataset, we generated the missing data using different 
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percentages of missing values. The value of the fuzzifier factor, m, was set to 2.0 for the 
artificial datasets and the Iris, Wine and RCNS datasets. For the Yeast, Yeast-MIPS and 
Serum datasets, m was set to 1.17 and 1.25 respectively, as in [60]. The number of 
clusters, c, was set to the known number of clusters. Each algorithm was run 5 times 
and the best result recorded. We repeated the experiment 50 times and averaged the 
performance of each algorithm using both measures. 
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Figure 4-7 : Average RMSE of 50 trials using an incomplete ASET5 dataset with 
different percentages of missing values 

Regarding the artificial datasets, ASET2 and ASET5; ASET2 has five well- 
separated clusters of similar size. ASET5 is more complex, containing three clusters 
that differ in size and density (Figure 3-16). On ASET2, fzPBI had the lowest RMSEs 
across different scales of missing data (Figure 4-6) and therefore performed better than 
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the other methods. On ASET5, fzPBI again performed best (Figure 4-7), although 
performance of the other methods was comparable. 

Applied to the Iris dataset, fzPBI and CIAO had the smallest RMSEs, although, 
as shown in Figure 4-8, fzPBI performed marginally better. Table 4.21 shows that, 
compared with CIAO, fzPBI had a smaller number of misclassified objects across 
different scales of missing values. 
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Figure 4-8 : Average RMSE of 50 trials using an incomplete Iris dataset with different 

percentages of missing values 

On the Wine dataset, on average, both fzPBI and CIAO outperformed the other 
methods. fzPBI and CIAO performed equally on data with percentages of missing 
values of 10% and lower. However, fzPBI performed slightly better than CIAO on data 
with higher percentages of missing values. FCMimp and FCMGOimp performed better 
than other methods, when the percentage of missing values was small (Figure 4-9). 
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However, they performed worse than all other methods on datasets with high 
percentages, i.e., greater than 15%, of missing values. 



Table 4.21 : Average results of 50 trials using an incomplete IRIS dataset with different 

percentages (%) of missing value 
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Figure 4-9 : Average RMSE of 50 trials using an incomplete Wine dataset with different 

percentages of missing values 
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Figure 4-10 : Average RMSE of 50 trials using an incomplete RCNS dataset with 
different percentages of missing values 

On the RCNS dataset (Figure 4-10), fzPBI outperformed other algorithms when 

the missing value percentage was 40% and lower. However, it performed marginally 

worse with a missing value percentage of 50%. This is because the RCNS dataset 

contains only 112 data points and becomes very sparse with high percentages of 

missing values; the data probability model was not properly determined, and fzDBI, 

therefore, could not perform correctly. 
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Figure 4-11: Average RMSE of 50 trials using an incomplete Yeast dataset with 
different percentages of missing values 

fzPBI outperformed other methods on the Yeast and Yeast-MIPS datasets 
(Figures 4-11 and 4-12); FCMGOimp was run using the GO term-based distance 
measure [11]. However, FCMGOimp outperformed only FCMimp. This result is similar 
to that reported in [158]. Using GO terms to measure the distance between genes is 
interesting. However, a crisp distance measure of {0,1}, where is the distance between 
a pair of genes having at least one GO term in common, does not help much with the 
problem of missing value imputation. 

fzPBI again outperformed other algorithms on Serum dataset with the 
percentage of missing value lower than 50% (Figure 4-13). For the rest of the cases, 
fzPBI was only outperformed by CIAO. On average, fzPBI performed best on the 
Serum dataset. 
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Figure 4-12 : Average RMSE of 50 trials using an incomplete Yeast-MIPS dataset with 

different percentages of missing values 
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4.2.7 Density based imputation method (fzDBI) [155] 

Similar to fzPBI, the fzDBI algorithm was developed to cluster a dataset X, that 
may be incomplete, into c clusters based on fuzzy partition generated using partial 
distance measure as in (4.41) and (4.42). The abstract fuzzy partition is derived from the 
FCM fuzzy partition using our fzSC method, as follows: 

Let = {U,V} be the fuzzy partition generated by FCM, which is a distance- 
based fuzzy partition describing the data density at the cluster centers. The accumulated 
density at the center of cluster Vk, k=l,...,c, is calculated as in (4.11). However, 
cannot properly describe the data density at every data point of X (4.2.2). A strong 
uniform fuzzy partition is then defined as in (4.15) and (4.16). The data density at every 

data point is then computed using (4.17). A set V, V = {v 1? v 2 ,...,v c } of the most 

dense data points, Cd, are then selected as cluster candidates. Because V represents the 
data densities in the dataset, we construct a density-based fuzzy partition© , = {U, V} , 
where Uis defined as: 

fi H =ZP(v,|v k )xui/2;P(v I |v k ). (4.48) 

1=1 / 1=1 

The estimated value of Xy e X M is computed based on both the FCM fuzzy 
partition, 0, and the density-based fuzzy partition, , 

x y =aJu k > kj /ju k n ; + (l-a)Xu k n :v kj /Xu k n :, (4.49) 

k=l / k=l k=l / k=l 
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where a, 0<a<l, indicates the contribution level of each of the two fuzzy partition 

models, and 9 , in the missing value imputation. We set a = 0.5 so that both the 
models contribute equally. To avoid an oscillation in the missing value imputation, 
estimated values are used decreasingly. In contrast with their usage in the clustering 
process, their contribution degree in distance measurement during the estimation 
process is defined as in (4.46). 

fzDBI algorithm 
Steps 

1) t = 0;c = Vn. 

2) Initialize U* randomly w.r.t (3.2). 

3) Compute {d 2 (x i9 v k )}i=i,...,n,k=i,...,c as in (4.41) and (4.42). 

4) Compute U t+1 using (3.3). 

5) Compute V t+1 using (3.4). 

6) If (t > T) or ( |j£i _ r m \\ < s ) then Stop. 

7) Compute {d 2 (x i9 v k )}i=i,..., n ,k=i,...,c using w ■ as in (4.46). 

8) Create a strong uniform fuzzy partition as in (4. 1 5) and (4. 1 6). 

9) Create a density-based fuzzy partition of X, 9 , as in (4.48). 

10) Estimate X M using (4.49). t=t+l. Go to Step 3. 

The difference between fzDBI and the standard FCM algorithm is that, in 
addition to a distance-based fuzzy partition, fzDBI generates a density-based fuzzy 
partition which is used with the distance-based one in missing value imputation. 

Method testing 

fzDBI was tested the same way with fzPBI, compared with the six popular 
imputation methods: PDS, OCS and NPS [140]; FCMimp [158]; CIAO [159]; and 
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FCMGOimp [160], fzDBI performed better on datasets of different levels (Figures 4-14 
-4-17). 
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Figure 4-14 : (fzDBI) Average RMSE of 50 trials using an incomplete ASET2 dataset 

with different missing value percentages 
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Figure 4-15 : (fzDBI) Average RMSE of 50 trials using an incomplete ASET5 dataset 

with different missing value percentages 
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Figure 4-16 : (fzDBI) Average RMSE of 50 trials using an incomplete Iris dataset with 

different percentages of missing values 
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Figure 4-17 : (fzDBI) Average RMSE of 50 trials using an incomplete Yeast-MIPS 
dataset with different percentages of missing values 
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4.2.8 Probability based defuzzification (fzPBD) [156] 

To generate classification information using fuzzy partition, the fuzzy partition 

needs to be defuzzified for crisp classification information of the data objects. A 
popular approach to this problem is to use the maximum membership degree principle 
(MMD). This may be inappropriate in some applications, particularly applications 
where the data are non-uniform, because FCM membership is computed using distance 
between the data object and cluster center. Use of membership degree can assign 
marginal objects of a large cluster to the immediately adjacent small cluster. Recent 
solutions include that of Chuang et al. [150], who proposed using spatial information to 
adjust the membership status of every data point using the membership status of its 
neighbors. In a similar approach, Chiu et al. [149] used spatial information to determine 
the distance between the data point and cluster center. However, these methods require 
definition of neighborhood boundaries. Genther et al. [151] proposed defining spatial 
information by the clustering structure, not the neighborhood, to compute the distance 
from the data point to cluster center. 

A common limitation of methods that use spatial information in computing 
membership degree is that they have to scale between the actual position information of 
the data point and its spatial information. In addition, while use of spatial information is 
appropriate for image segmentation, it may not work with generic data cluster analysis 
because it is difficult to define neighborhood boundaries. 
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We developed fzPBD that uses the fuzzy partition and the data themselves to 
construct a probabilistic model of the data distributions. The model is then applied to 
produce the classification information of data points in the dataset. 

Method design 

We applied the statistics model in our fzBLE method [141] to generate a 
probabilistic model of the data distribution using fuzzy partition. 

Given a fuzzy partition matrix U, the vector Uk = {uki}i=i,..., n , k=i,...,c, is a 
possibility model of the data distribution of Vk on X. We associate {Pk}, the probability 
distributions, with the possibility distributions {Uk} of Vk on X, k=l,...,c, using (4.22) 
and (4.23). The statistics at Vk, k=l,...,c, are then derived using (4.24), (4.25) and 
(4.26). 

Using the idea based on the Central Limit Theorem, "The distribution of an 
average tends to be normal, even when the distribution from which the average is 
computed is decidedly non-normal'' the data objects can be assumed to follow the 
normal distributions at {vk}k=i,...,c, defined as in (4.26). Therefore, the data object Xi is 
assigned to the class of Vk, where 

P(v k |x i ) = max{P(v 1 |x i )}. (4.50) 

1=1. .c 

Because P(v k |xO = P(xi,v k )/P(xi) = P(xi|v k )xP(v k )/P(xi), an alternative to (4.50) is 

P(v k |x i ) = max{P(x i | Vl )xP( Vl )}, (4.51) 

1=1. .c 

where P(vi), 1=1,. . .,c, the prior probability of vi, can be computed using (4.25). 
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fzPBD algorithm 
Steps 

1) Convert the possibility distributions in U into probability distributions using 
(4.22) and (4.23). 

2) Construct a probabilistic model of the data distributions using (4.24) and 
(4.26). 

3) Apply the model to produce the classification information for every data 
point using (4.51). 

Method testing 

We compared fzPBD with the MMD, FCB, NBM and sFCM methods on 
ASET2, ASET3, ASET4 and ASET5 using the compactness measure (3.43) and 
misclassification measure (3.21) for evaluation of algorithm performance. For each 
dataset, the following values of the fuzzifier factor, m were used: 1.25, 1.375, 1.50, 
1.625, 1.75, 1.875 and 2.0. The number of clusters, c, was set to the known number of 
clusters. The FCM algorithm was run 3 times and the best fuzzy cluster partition was 
selected to test all the algorithms. We repeated the experiment 100 times and averaged 
the performance of each algorithm using the two measures. 
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Figure 4-18 : Algorithm performance on ASET2 
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Figure 4-19 : Algorithm performance on ASET3 
ASET2 and ASET3 each contain five well-separated clusters of the same sizes. 
ASET1 and ASET2 are in two-dimensional and three-dimensional data space, 
respectively. Performance of all algorithms is shown in Figures 4-18 and 4-19. fzPBD, 
MMD and sFCM generated no misclassification across multiple levels of m. 
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Figure 4-20 : Algorithm performance on ASET4 
ASET4 contains 11 well-separated clusters in a five-dimensional data space. 
Results are shown in Figure 4-20. fzPBD and MMD performed best. Table 4.22 shows 
that they also had the smallest compactness measures across multiple levels of m. 

Table 4.22 : ASET4- Compactness measure 



Algo. 


Fuzzifier factor m 
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0.49 
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ASET5 is non-uniform with three clusters in a two-dimensional data space 
(Figure 3-16). Results are shown in Figure 4-21. fzPBD outperformed all the other 
algorithms. 










K__ 


--+-- 


__+ 


- -+- 


H 


- + 






25 - 


















+■» 
C 

o 

Q. 

ro 

+■» 
ro 

-a 
■a 
o 


20 - 

15 - 
















■ fzPBD 












^_ 


# 




-♦- 


♦— 


+"—' 
















in 


















— ♦ -MMD 


in 

E 


10 - 
















•••X--- FCB 

NBM 
















.Q 


















sFCM 


3 

z 


S 


















(.... 


yd 


$Kr • • • • 


*• 


—JK— 


-x- 


••••X 




I 


































1.25 


1.375 


1.5 


1.625 
Fuzzifier (m) 


1.75 


1.875 


2 





Figure 4-21 : Algorithm performance on ASET5 
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Figure 4-22 : Algorithm performance on IRIS 
The IRIS dataset contains three clusters corresponding to the three classes of Iris 
flowers. Results are shown in Figure 4-22. NBM outperformed other algorithms. fzPBD 
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performed lightly less well, however, Table 4.23 shows that, compared with NBM, 
fzPBD had smaller compactness measures across multiple levels of m. 

Table 4.23 : IRIS: Compactness measure 



Algo. 






Fuzzifier factor 


m 
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Figure 4-23 : Algorithm performance on WINE dataset 
The Wine dataset contains information on 13 attributes of the three classes of 
wines. Results are shown in Figure 4-23. fzPBD and FCB performed better than other 
algorithms. fzPBD performed better than FCB with the fuzzifier factor levels less than 
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1.8. Also, Table 4.24 shows that fzPBD performed better than FCB across multiple 
levels of m. 

Table 4.24 : WINE- Compactness measure 



Algo. 






Fuzzifier factor m 
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Figure 4-24 : Algorithm performance on GLASS dataset 
The Glass dataset contains information on nine attributes of six classes of glass 
used in building construction. Results are shown in Figure 4-24. fzPBD significantly 
outperformed all the other algorithms. fzPBD outperformed the other methods on both 
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artificial and real datasets, particularly on the datasets with clusters that differed in size. 
fzPBD is, therefore, appropriate for real-world datasets, where usually the data densities 
are not uniformly distributed. 

4.2.9 Fuzzy genetic subtractive clustering method (fzGASCE) [157] 

We have addressed the key problems of the fuzzy C-Means algorithm: 

initialization of the fuzzy partition matrix, evaluation of fuzzy clustering results, 
missing value imputation and fuzzy partition defuzzification. The experimental results 
show that our methods outperformed the existing algorithms addressing the same issues. 
Yet another issue with FCM algorithm is that, like most other partitioning approaches, it 
cannot by itself determine the number of clusters, and, as in previous studies, its results 
depend strongly on initial parameters. For some initial values, FCM will converge 
rapidly to a global optimum, but, for others, it may become stuck in a local optimum. 
Common approach to address the limitations of FCM is to integrate FCM with the 
Genetic Algorithm (GA), where the GA is used to manage a set of solution candidates. 
The FCM algorithm is then applied and a cluster validity index is used as the GA fitness 
function to search for the best solution. Ghosh et al. [163] and Liu et al. [166] proposed 
to use the partition coefficient (PC) [92] and Xie and Beni (XB) [94] validity indices. In 
addition, Liu et al. [166] proposed a modified version of the PC index (MPC) in order to 
reduce the monotonic tendency of the index. The Fukuyama-Sugeno cluster index (FS) 
[93], which measures the compactness and separation of the cluster partition, was used 
by Ghosh et al. [163]. Lianjiang et al. [164], in a novel self-Adaptive Genetic Fuzzy C- 
Means algorithm (AGFCM), proposed a validity index combining the PC index with 
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total variation of fuzzy partition. Haider, Pramanik and Kar [162] proposed a GA fitness 
function (HPK) based on the compactness measure and combined it with an intra-inter 
validity index for a novel algorithm that automatically determines the number of 
clusters in the dataset. Lin et al. [165] proposed a combination of GA and FCM for a 
novel method with adaptive cluster validity index (ACVI) based on the intra and inter 
measures of the fuzzy partition, where GA is used to solve the trade-off problem 
between these two factors for a better evaluation of the fuzzy partition. 

A common limitation of existing methods using GA with FCM is that the GA 
fitness functions are based on the cluster validity indices, which usually have a problem 
with scaling between the compactness and separation factors. In addition, they use the 
maximum membership degree for defuzzification that may be improper, as described in 
Section (4.2.8). In this section, we combine FCM with GA and fuzzy SC algorithms for 
a novel clustering algorithm, fzGASCE, that automatically determines the number of 
clusters in the dataset. The FCM algorithm rapidly determines the exact clustering 
prototypes of each solution candidate so that the GA algorithm, managing a set of such 
candidates, can select the optimal one. The fuzzy SC method helps the GA algorithm 
escape any local optima [157]. 

The fuzzy subtractive clustering (fzSC) method has advantages over the 
traditional one in that it is more efficient from the computational viewpoint. In addition, 
it is more appropriate for real world problems, because it does not require additional 
information regarding the mountain peak, mountain radii, or the threshold between the 
current density and original density for a data object to be selected as a cluster 
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candidate. However, fzSC uses a random fuzzy partition, and may perform worse in 
cases where the partition is not the result of a convergent fuzzy partitioning process. For 
the artificial dataset, the FCM algorithm can converge very rapidly, but for some real 
datasets, because of overlap regions where a data point can belong to two or more 
clusters, the FCM algorithm may not converge to a local optimum. The fuzzy strong 
uniform criteria may not hold with the new fuzzy partition created from the fuzzy 
partition by the FCM algorithm. Hence, the cluster candidates determined by our fzSC 
may not be optimal. In addition, a local optimal fuzzy partition may cause the fzBLE 
method to improperly pick the best partition from a given set of fuzzy partitions. 

Method design 

Chromosome 

We used chromosome to represent the whole clustering solution. Each 
chromosome contains a set of loci each standing for the index of the data point selected 
as cluster center. We set the length of chromosomes to Vn, which is assumed the 
maximum number of clusters in the dataset. 

Crossover operator 

The crossover operator is used to produce two new offspring from a given pair 
of parents. Both the roulette wheel and tournament selection methods are used 
interchangeably to select parents maintaining potentially useful solutions in the current 
generation. A two-point crossover operator with probability P c , P c = 0.5, is used. 

Mutation operator 
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The mutation operator is used to make changes in portions of the chromosomes 
of newly created members. Because each chromosome encodes data points representing 
the cluster centers of a clustering solution, changing of the data points in the 
chromosome may improve the clustering quality. We, therefore, propose three different 
tasks for the mutation operator: (i) add a data point as a new cluster center, because this 
may help to locate a new cluster in a higher density region, (ii) remove a data point to 
prevent the inclusion of a sparse cluster, and (iii) replace one data point with another so 
that the new cluster is located in a higher density region. These tasks are commonly 
used by existing methods. However, they are employed in a random or heuristic way, 
and cannot guarantee that the GA algorithm will escape local optima. To address this 
issue, we applied our fzSC method. The fuzzy partitions of the parents are used to 
estimate the density at every data point using (4.13), (4.14), (4.15), (4.16) and (4.17). At 
each time, t, the data point, x* with the highest density, M*, is selected, and the 

densities at the remaining data points are updated as in (4.19). Such points, where the 
densities change less than a predefined ratio, R M , are considered to be significantly 
dense and are used in the mutation operator of fzGASCE. This is instead of using 
randomly selected data points, as in existing methods. We chose a value of 0.95 for R M . 
The selection of the value of R M does not affect the outcome; however, a low value of 
Rm may slow the convergence process of fzGASCE. 

Fitness function 

Instead of using a cluster validity index for the fitness function, we use our 
cluster validity method, fzBLE, for cluster evaluation. For each chromosome, FCM is 
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applied to generate the fuzzy partition, which is then used to generate a probabilistic 
model of the data distributions as in (4.25), (4.26) and (4.27). The fitness function, 
given a fuzzy partition 0, = {U,V}, is defined: 

fitness(0) = log[L(0|X)] - log(c), (4.52) 

where L(0|X), the likelihood of the clustering model and the data, is measured as: 

L(0|X) = L(U,V|X) 

n n c (4 53^) 

=np(x 1 iu,v)=nzp(v k )xp( Xl iv k >. 

i=l i=l k=l 

fzGASCE algorithm 

Input: data to cluster X={xi}, i=l,. . .,n. 
Output: an optimal fuzzy clustering solution, 

c: optimal number of clusters. 

V = {vi }, i =l,...,c: the cluster centers. 

U= {uki} , i= 1 , . . . ,n, k= 1 , . . . ,c : the partition matrix. 
Steps 

1) Randomly generate a set of chromosomes describing clustering solution 
candidates. 

2) Compute the fitness value for every chromosome. 

3) If the stop criteria are met, then go to Step 7. 

4) Apply the crossover operator with the probability P c = 0.5, and the roulette 
wheel and tournament parent selection methods. 
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5) Apply the mutation operator with the probability P m = 0.01 . The significantly 
dense data points are used in the replacement, and fresh offspring are created 
using these points with a probability of P c xP m . 

6) Go to Step 2. 

7) Select the 'best' chromosome from the population as the clustering solution. 

8) Apply [157] for the defuzzification of the fuzzy partition of the solution. 

Method testing 

To evaluate the performance of fzGASCE, we used four artificial datasets 
ASET2, ASET3 and ASET4 each have well-separated clusters of similar sizes. The 
number of clusters and data dimensions of these dataset are (5,2), (5,3) and (11,5), 
respectively. ASET5 is more complex, containing three clusters that differ in size and 
density (Figure 3-16). For real datasets, we used the Iris and Wine datasets. 

Performance measures 

We used three measures, COR, EVAR and MISC, to evaluate algorithm 
performance. COR is the correctness ratio, defined as, 

COR = ^|>-c), ( 4 - 54 ) 

where N is the number of trials, c and c are the number of clusters and the predicted 
number of clusters respectively, and I(.) is defined as: 

[ l,x = 

I(x) = 

0,x*0 
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EVAR is a measure of the accuracy of the predicted number of clusters defined as in 
(4.55). 

EVAR = ^V( C " £ ) 2 - ( 4 - 55 ) 

MISC is a measure of the overall performance, determined by the number of data 
objects that were misclassified, defined as in (3.21). To be fair, MISC is calculated only 
when an algorithm correctly identifies the number of clusters. Then, the assigned cluster 
label of each object is compared with its actual cluster label. If any of them do not 
match, a misclassification has occurred. 

We compared the performance of fzGASCE with those of the eight genetic 
algorithm methods that also use FCM, specifically, PBMF, MPC, HPK, AGFCM, XB, 
FS, PC and ACVI [162-166]. An earlier version of fzGASCE, fzGAE, which did not 
include the fuzzy SC and defuzzification methods, was also used. For each dataset, the 
number of clusters, c, was set to the known number of clusters. All algorithms were run 
using a population size of 24 and a maximum number of 100 generations. The fuzzy 
partition of each chromosome was generated using the FCM algorithm with 10 
iterations, and the fuzzifier factor, m, was set to 1.25. We repeated the experiment 100 
times and averaged the performance of each algorithm using values of COR, EVAR and 
MISC. 

ASET2 contains five clusters in a 2-dimensional data space. The clusters are well- 
separated and of the same size. Performance of all algorithms is shown in Table 4.25. 
All algorithms had very low MISC measures, indicating that they grouped the data 
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points into the correct clusters. However, fzGASCE outperformed all algorithms by all 
three measures and fzGAE performed better than the other methods with the exception 
of fzGASCE. This comparison illustrates the advantage of using our fzBLE method 
[147] in the GA fitness function. The use of fuzzy SC [146] in fzGASCE also improved 
its performance, particularly in escaping local optima. 

Table 4.25 : Performance of GA algorithms on ASET2 



Algorithm 


COR 


EVAR 


MISC 


fzGASCE 


1.000 


0.000 


0.000 


fzGAE 


0.640 


0.500 


0.000 


PBMF 


0.510 


0.590 


0.000 


MPC 


0.290 


0.970 


0.000 


HPK 


0.100 


5.010 


0.021 


AGFCM 


0.600 


2.800 


0.000 


XB 


0.490 


1.450 


0.000 


FS 


0.120 


1.100 


0.070 


PC 


0.230 


1.040 


0.000 


ACVI 


0.200 


2.490 


0.011 



The ASET3 contains five well-separated clusters in a 3 -dimensional data space 
while the ASET4 contains 11 clusters in a 5-dimensional data space. Performance of all 
algorithms on ASET3 and ASET4 are shown in Tables 4.26 and 4.27 respectively. On 
both datasets, fzGASCE outperformed all other algorithms, while fzGAE was the 
second best. 
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Table 4.26 : Performance of GA algorithms on ASET3 



Algorithm 


COR 


EVAR 


MISC 


fzGASCE 


1.000 


0.000 


0.000 


fzGAE 


0.710 


0.380 


0.000 


PBMF 


0.600 


0.450 


0.000 


MPC 


0.610 


0.860 


0.000 


HPK 


0.120 


5.240 


0.000 


AGFCM 


0.650 


1.490 


0.000 


XB 


0.640 


0.430 


0.000 


FS 


0.520 


0.840 


0.011 


PC 


0.620 


0.890 


0.000 


ACVI 


0.100 


2.100 


0.000 



Table 4.27 : Performance of GA algorithm on ASET4 



Algorithm 


COR 


EVAR 


MISC 


fzGASCE 


1.000 


0.000 


0.000 


fzGAE 


0.450 


0.750 


0.000 


PBMF 


0.340 


1.000 


0.000 


MPC 


0.420 


0.820 


0.002 


HPK 


0.010 


1.910 


0.037 


AGFCM 


0.340 


2.380 


0.000 


XB 


0.410 


0.900 


0.000 


FS 


0.400 


0.880 


0.000 


PC 


0.450 


0.700 


0.000 


ACVI 


0.170 


4.650 


0.000 
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The ASET5 dataset is non-uniform with three clusters in a 2-dimensional data 
space. Table 4.28 shows the algorithm performance. HPK and AGFCM both failed to 
determine the number of clusters. fzGASCE not only outperformed the other 
algorithms, with COR =1.0, but also was the only algorithm that grouped all of the data 
points into the correct clusters. Although fzGAE performed better than the remaining 
algorithms, it failed to correctly group data points into clusters, similar to other methods 
that used the maximum membership degree for defuzzification. 

Table 4.28 : Performance of GA algorithms on ASET5 



Algorithm 


COR 


EVAR 


MISC 


fzGASCE 


1.000 


0.000 


0.000 


fzGAE 


0.900 


0.100 


0.107 


PBMF 


0.700 


0.300 


0.107 


MPC 


0.050 


0.960 


0.107 


HPK 


0.000 


5.770 


- 


AGFCM 


0.000 


8.470 


- 


XB 


0.040 


0.960 


0.107 


FS 


0.020 


3.480 


0.107 


PC 


0.050 


0.960 


0.107 


ACVI 


0.080 


0.920 


0.107 



The IRIS dataset contains three clusters corresponding to the three classes of Iris 
flowers. The performance of the algorithms on this dataset is shown in Table 4.29. HPK 
and AGFCM again completely failed at detecting the number of clusters. fzGASCE 
outperformed other algorithms in detecting the number of clusters as well as in 
grouping data points into their own clusters. 
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Table 4.29 : Performance of GA algorithms on IRIS 



Algorithm 


COR 


EVAR 


MISC 


fzGASCE 


1.000 


0.000 


0.033 


fzGAE 


0.880 


0.120 


0.040 


PBMF 


0.860 


0.140 


0.040 


MPC 


0.040 


0.970 


0.160 


HPK 


0.000 


5.720 


- 


AGFCM 


0.000 


8.120 


- 


XB 


0.050 


1.010 


0.040 


FS 


0.390 


0.780 


0.154 


PC 


0.080 


0.920 


0.115 


ACVI 


0.150 


0.850 


0.040 


Table 4.30 : Performanc of GA algorithms 


; on WINE 


Algorithm 


COR 


EVAR 


MISC 


fzGASCE 


1.000 


0.000 


0.213 


fzGAE 


0.860 


0.140 


0.303 


PBMF 


0.000 


2.050 


- 


MPC 


0.000 


2.810 


- 


HPK 


0.000 


6.760 


- 


AGFCM 


0.000 


9.210 


- 


XB 


0.270 


1.010 


0.303 


FS 


0.000 


5.720 


- 


PC 


0.110 


0.920 


0.303 


ACVI 


0.090 


0.910 


0.303 
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The Wine dataset contains information on 13 attributes of three classes of wines. 
Results on this dataset are shown in Table 4.30. Only fzGASCE and fzGAE identified 
the correct number of clusters, with COR values of 1 and 0.86, respectively. Among the 
other algorithms, only XB and ACVI detected the correct number of clusters, but only 
with low COR values. Overall, fzGASCE outperformed other methods on both artificial 
and real datasets. It performed particularly well on datasets with clusters that differed in 
size, not only in predicting the correct number of clusters, but also in grouping the data 
points into the correct clusters. fzGASCE is, therefore, appropriate for real-world 
problems, where the data densities are not uniformly distributed. 

4.3 Motif finding problem 

4.3.1 HIGEDA algorithm 

We present a new algorithm, HIGEDA [145], applicable to either DNA or 

protein sequences, which uses the TCM model and combines the hierarchical gene-set 
genetics algorithm, HGA (Hong, 2008) with EM and Dynamic Programming (DP) 
algorithms to find motifs with gaps. The HGA algorithm helps HIGEDA manage motif 
seeds and escape from local optima; the EM algorithm uses the best alignment of a 
motif model on the dataset, where the alignments are generated by DP to estimate the 
motif model parameters so that the model fits the best conserved forms of the motifs of 
interest. 

Method design 
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Given A = {ai, a2,..., ai}u{_}, a set of symbols used for sequence encoding and 
the gap symbol, 1=20 for protein sequences and 1=4 for DNA sequences, let S = {Si, 
S2,.,., S n } be a set of biological sequences based on A. Assume L = {Li, L 2 ,..., L n }, 
where Li is the length of sequence i, i=l,. . .,n. Assume that the length of shared motifs is 
a known value, W. 

With mi = Li-W+1, the number of possible positions of a motif in sequence i, 
denote Z = {zy}, i=l,...,n, j=l,...,mi, as the probability of motif occurrence at position j 
in sequence i. 

Let be the motif model. 

The motif finding problem is to determine and Z such that: 

P(S,Z|0)->max. (4.56) 

Motif model 

Of the motif model 0, two components M and B model the motif and non- 
motif (the background) positions in sequences. A motif is modeled by a sequence of 
discrete random variables whose values give the probabilities of each of the different 
letters occurring in each of the different positions of a motif occurrence. The 
background positions in the sequences are modeled by a single discrete random 
variable. The motif model is as follows: 
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(4.57) 



where, a ,k, l<k<W, is the probability that symbol a occurs either at a background 
position or at position k of a motif occurrence. 



5X k =i,vk=i,...,w. 



(4.58) 



asA 



The matrix in (4.57) is a position weight matrix (PWM), but its use differs 
from conventional PWMs in that it contains 21 rows for protein motifs and 5 rows for 
DNA motifs. The first row stands for the gap symbol, which may occur in a motif, but 
not in the first or last positions. The remaining rows stand for residue symbols, 20 
amino acid letters or 4 nucleotide letters. W is chosen large enough to accommodate all 
possible motif consensus sequences. Figure 3-7 shows a motif model representing 
'AC, 'AGC and 'ATC\ 

Sequence and motif model alignment with gaps 

Given an input subsequence s, the best alignment with gaps of s and is created 
using DP algorithm. Because s comes from a real dataset, gaps are allowed only in 0. 
The first symbol of s is aligned with the first column in 0. Consecutive symbols from s 
are aligned with either gap or symbol columns to achieve the best alignment score. A 
conventional dynamic alignment score is the sum of the pair wise alignment scores. 
Instead, here, it is the multiplication of all pair wise alignment scores, and hence it is the 
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probability that the best consensus from matches s. To control the occurrence of gaps, 
we define PPM, POG and PEG as the reward for a perfect match and the penalties for 
opening and extending a gap, respectively. We choose, by empirical experiments, 
PPM=1, POG=0.00875, and PEG=0.325. Let U jk be the best alignment score up to the 
j th symbol Sj in s and column k of©. Ujk is calculated as in (4.59). 



U 



jk 



u j _ uk _ 1 *pp M *e s %,k=j 
u, k _ 1 *PG*(i-e s ^ i;k+1 ),j=o,k>j 

[maxdJ^ *PPM*O s %,U j;k _ 1 *PG*(l-^ +i , k+1 )} 



(4.59) 



where PG is either PEG or POG depending on the gap status at column k in U. Let 
s='ACG' and be as shown in Figure 4-25. Figure 4-26 shows the dynamic alignment 
matrix {Uy} of s w.r.t 0. The best alignment score is 0.0039 which is found in the 
second row and the last column of the matrix. Hence, the best consensus for s=' ACG' 
by is 'AC'. 'AC is considered in-motif model and the last symbol 'G' is out-motif 
model for subsequence s. The motif model will be adjusted to fit 'AC' instead of 
'ACG'. This feature produces the occurrence probabilities for gap symbols in our motif 
model even if such a symbol does not appear in the sequences set. 
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Figure 4-25 : A motif model 
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Figure 4-26 : Dynamic alignment of s='ACG' w.r.t © from Figure 4-25 

Motif occurrence model 

Let S 1J be the subsequence of length W at position j in sequence i. Denote I(.) as 
the indicator function and Pm(S 1j ) and Pb(S 1j ) as the conditional probabilities that S 1J is 
generated using the motif model and the background model respectively. 



W L 



p M (s') = p(s ij i © m ) = nnfc) I<s " )tk " a) 



k=l a=l 



W L 



p B (s ij )=p B (s ij ie B )=nn( e aor" ,+ 



=a) 



(4.60) 



(4.61) 



k=l a=l 



Let X be the prior probability of motif occurrence at every possible position in 
the sequence set. Similarly to the TCM model in MEME [70], the motif occurrence 
probability at position j in sequence i is 



z ij = " 



^*P M (S 1J ) 



X*P M (S 1J ) + (1-^)*P B (S 1J ) 
The motif log-odds score of S 1J is defined by: 



los(S 1J ) = log(P M (S 1J )/P B (S 1J )). 



S J is considered a motif hit if it satisfies the inequality: 



los(S 1J )>log[(l-X)A]. 



(4.62) 



(4.63) 



(4.64) 
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Gapped motif model 

When gaps are allowed in the alignment of S 1J and 0, Pm(S 1j ) and Pb(S 1j ) are 
calculated using the recovered version with gaps of S 1J , similar to that in Xie [178]. Let 
s m and s c be the in-motif and out-motif parts, respectively. Let s be the best consensus 
for S ij by 0. 

P M (S 1J ) = P B (s c )*P M (sO = P(s c |Z 9 B )*P(s*|Z 9 M ), (4.65) 

PB(S 1J ) = PB(s c )*P B (s m ) = P(s c |Z ? B )*P(s m |Z 9 B ). (4.66) 

Multiple-motif finding method 

To find multiple motifs in given dataset, we use a mask variable M, 
M={Mij}i=i j ... >n j=i v . Mm , where My represents the probability of the chance that a motif 
occurs again at position j in sequence i. M is initially set to {1}. The modified version 
of (4.62) with respect to M is 

z™ = P( Zij = 1 1 M) = z, * mm{M 1J+H } . (4.67) 

Once a motif is found, its positions are updated to M. 

M i(j+k) =M i(j+k) *(l-AM i(j+k) ), (4.68) 

where AM i(j+k) = P(zij=l)*(W-k)/W, k=0,...,W-l. The update mechanism in (4.68) 
allows for multiple overlapping motifs. 

Hierarchical gene-set genetics algorithm (HGA) 

The GA is a global optimization procedure that performs adaptive searches to 
find solutions to large scale optimization problems with multiple local optima. 
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Conventional GAs use crossover and mutation operators to escape local optima. These 
operators depend strongly on how the probabilities of crossover and mutation are 
chosen. Recent improvements in GA have focused on adaptively adjusting operator 
probabilities so that the genetics processes quickly escape local optima. However, 
setting up adaptive GAs is difficult and most approaches are based on heuristics. 

Here we use HGA, a GA improved by Hong [75]. HGA treats a chromosome as 
a set of gene-sets, not a set of genes as in conventional GAs, as a mechanism to escape 
local optima. Starting with gene-sets of the largest size equal to half the chromosome 
length and ending with gene sets of size 1, HGA performs crossover and mutation 
operations based on gene-set boundaries. When the model is (8, k)-convergent, HGA 
expects to find a global optimum and attempts to escape local optima, if any exist, by 
performing genetics operations with the largest size gene-sets. HGA is most appropriate 
to our genetics model, because each gene-set represents a set of adjacent columns in the 
motif model that represents patterns of adjacent residues in biological sequences. 
Genetics operations based on gene-set boundaries allow residues to come together as 
well as to change simultaneously. 

EM algorithm ofHIGEDA 

HIGEDA solves the motif finding problem using an EM algorithm to maximize 
the likelihood function (4.69) over the entire dataset. 

L(S,Z | 0) = logP(S,Z | 0) -> max . (4.69) 

Estimation step: Z is estimated using (4.63) (or (4.67)). 
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,(t+l) _ 



^*P M (S iJ ) 
^ t *P M (S 1J ) + (l-X t )*P B (S 1J ) 



(4.70) 



Maximization step: and X are computed with respect to (4.70), 



L(S,Z|0) = X|;(l-z ij )logP B (S ij ) + z ij logP M (S ij ) 

i=l 1=1 



Such that, 



+ (1 - z { . ) log(l -X) + z {] log(X) -> max . 



(4.71) 



dL 1 ^ 

3a nmtFj^ 



(4.72) 



With respect to constraints (4.58), the objective function is relaxed using Lagrange 
multipliers yk, k=0,. . .,W, 



w ( x \ 

L = L(s,z|0)+Xy k Ze ak -i . 

k=0 V a=l y 

aio g p M (s ij ) i ap M (s ij ) 



50 



ak 



p m (s ij ) ae a M k 



p m (s 1j ) 



W L 



nnfc) 1 



J(S i , +k ._ 1 =a') 



k'=l a'=l 
^kVk 



I(S lJ+k _, = a) = ' " J+ 



I(S,H. k -i=a) 







ak 



Hence, 
8L 



^ I(S ij+k -!=a) 



<? W ak i=l j=l 

So, 



ak 



ii in 



M _ i=l j=l 



er k = - 



(4.73) 
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Using (4.58) and (4.73) 

1 n m 

SZE^^S^^a) 



M a=l i=l j=l 



IX =- 



a=l Yk 

n m 1 n m 

_ i=l j=l a=l i=l j=l ^ 



(4.74) 



Tk 



Using (4.73) and (4.74), 



ZZ* H = a ) 



^ = - H n m • (4-75) 



i=l j=l 



z ij 



Similarly, 



n m W 

ZZ( 1 - z «)Z I ( s u*-i= a ) 



0?o = ^ '" H ■ m M • (4-76) 

i=i j=i 

Refining model parameters 

The maximization step results in model parameters © and X. Because the values 
of Z are estimated during the EM processes, a straightforward update to © and X using 
(4.70), (4.75) and (4.76) may cause an oscillation in convergence. MEME estimates X 
by trying values from \l\m4n J to 1/(W+1). This requires significant computational time 
and squanders the maximization step benefit. We apply the gradient descent learning 
law to update model parameter, £,, of the model, 

4(t+l) = $(t) + Mt x A^, (4.77) 
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where AS, = £, e (t+l) - £,(t) and £,(t), £, e (t+l) and £,(t+l) are the current, estimated and new 
values of £,, respectively. The learning rate jLX t may be slightly reduced by processing 
time. A popular form of jLX t is as in (4.78), where T is the processing duration. 



^^(i-X.Ixt)- (4 - 78 > 



If the size of the dataset is small, some elements of© may be zero. These are not 
appropriate to the Bayesian process and remain zero. To solve this problem, MEME 
uses Dirichlet mixture model [34]. While this has a strong mathematical basis, the 
drawback lies in how to select the right number of mixture components. In addition, this 
number has no meaning in sequence analysis. Instead, we use pseudo-count methods. 
For DNA, we borrow an added pseudo-counts method from Henikoff and Henikoff 
[74]. The added portion used is 0.01. For proteins, we propose a method using the motif 
model and the substitution probability matrix from BLOSUM62, and a heuristic that the 
pseudo-counts should be position specific and depend on strong signals [145]. The 
pseudo-count value for a given symbol a, in column k of © is calculated using Equation 
(4.79). 

P SC a k =Z l P a/b> (4.79) 

b=l 

where P a /b is the BLOSUM substitution probability for amino acid a from the 
observation of amino acid b. The motif model is then refined using (4.80) in which, a 
and P are predefined, and a+P = 1 . 

©'a,k = a.0 a ,k+p.psc ak . (4.80) 
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HIGEDA Algorithm 

HIGEDA uses HGA to manage its population and to escape local optima. 
During the evolution process, members of the current generation are processed with the 
EM algorithm a small number of times, and ranked based on their goodness, measured 
using a fitness function. The best members are used to create new ones using crossover 
and mutation operators. Newly created members replace the worst ones in the current 
generation using a tournament selection to form the new generation. This process 
repeats until the number of generations is equal to a predefined number, or the current 
generation goodness ratio is convergent. The best members from the last generation are 
taken as possible motif candidates. 

Each member contains a variable X and a chromosome encoding 0, described in 
(4.57); each gene in the chromosome represents a column in © that has 21 elements for 
a protein motif, or five elements for a DNA motif. There are W+l such genes in each 
chromosome. We note that while Bi [71] proposed encoding PWM using chromosomes, 
he did not discuss it in detail. Because the length W of a motif is small relative to the 
sequence lengths, our genetics model consumes less memory than those of Wei and 
Jensen [83] or Bi [71]. 

The gene-set maximum size is lo = W/2, the initial size is 1 and the final size is 1, such 
that, 

1 = 2q, where 2q < 1 < 2q+l. (4.81) 

We use 8=0.05 and k=9 for (8, k)-convergent criterion. 
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The crossover operator is used to produce two new children from a given pair of 
parents. We use a two-point gene-set crossover operator with probability p c = 0.95. 

The mutation operator is used to make changes in some portions of the chromosomes of 
newly created members. Because each chromosome encodes a © resulting from an 
alignment on the whole sequence set, a one position shift to the left or right may 
improve the quality of the alignment. We propose two different tasks for the mutation 
operator: (i) change in a gene-set: Two rows are selected randomly. Cells corresponding 
to the selected rows in the given gene-set are exchanged, and (ii) change to whole 
chromosome: Gene-sets in the chromosome are shifted left or right by one position. The 
blank gene-sets are filled with average probability values. 

The fitness function of HIGEDA is a combination of the objective function (4.69) and 
the posterior scoring function (4.80). The objective function assesses the best fit model, 
while the posterior scoring function determines how well the model is distinguished 
from the background in the entire dataset. We use the motif posterior scoring approach 
of Nowakowski-Tiuryn [79] and estimate the prior probability of motif occurrence. It 
follows that the posterior score of subsequence S 1J at position j in sequence i has the 
form: 



, pv J 1J |Q^) = log XxPm( S ' J) 
(l-^)xP B (S 1J ) 



S p (S 1J |0A) = k)g „ ^ / (4.82) 



(4.69) and (4.80) are normalized and combined. Both are important to the model 
evaluation, but not equally in all contexts. The fitness function is first applied to find 
best fit models, and then used with the posterior scoring function to select the most 
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significant model. To this rule, we add a scaling function v(t), which is shown in Figure 
4-27. The fitness function of HIGEDA is then defined as: 



v 
fit(0) = ^-L(S,Z|0) + (v M -v)££ 



=1 .1=1 



P(Zjj=l) 

|P(Zij=l)| 



:S p (S 1J |©A). (4.83) 



v M 



Figure 4-27 : v(t) = Vm x T / (T + t 2 ). Vm is the maximum value of v. 

To find gapped motifs, in later phases, HIGEDA tries to improve the most 
significant model by introducing gaps. This is similar to a local alignment problem, but 
instead of aligning every sequence to the sequence set, HIGEDA, using DP, aligns the 
sequence to the motif model that describes the local alignment of the sequence set. 
While the run time of gapless alignments is O(W), that of DP alignment is 0(W 2 ). By 
restricting the use of DP, the speed of HIGEDA is significantly improved. 
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HIGEDA Algorithm [145] 

• Input: Sequence set S, motif length W. 

• Output: Motif model 0. 
Steps 

1) Set M={1} once, set gene-set parameters using (4.71), and randomly 
generate the first generation, set gNo=0. 

2) For each member in the current generation, apply EM once. Measure 
member goodness using the fitness function (4.73). 

3) Select parents from the best members using ranking selection; apply genetics 
operator to create new members. 

4) Create new generation using tournament selection, gNo++. 

5) If the current generation is (8, k)-convergent, then 

6) save ( 1 ), set 1=10, 

7) Apply genetics operators to refresh current generation. 

8) Restore ( 1 ), go to Step 2. 

9) If stop criteria are met, then 

10) set 1 = 1/2, set gNo = 0. 

11) Ifl>0,thengotoStep2. 

12) Take the best member as the motif candidate; run EM with the candidate 
until convergent to obtain the motif model. 

13) Output the motif model and update M using (4.68). 

14) Stop, or go to Step 1 to find more motifs. 

Method testing 

We compared HIGEDA with five open source algorithms, GAME-2006, MEME 
v3.5, GLAM2-2008, BioProspector-2004 and PRATT v2.1, one web application, 
GEMFA (http://gemfa.cmh.edu), and for the EM motif algorithm, we used results from 
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Bi [71]. All algorithms were run 20 times on each dataset with the runtimes recorded; 
for GEMFA, this was obtained from the website. Default run parameters were used for 
all programs. For HIGEDA, the number of generations at each gene-set level was 50 
and the population size for all gene-set levels was 60. For algorithms in Tables 4.31, 
4.32 and 4.33, motif lengths were known motif lengths. For Tables 4.34 and 4.35, we 
first ran HIGEDA to find statistically significant motifs, and then ran other algorithms 
using the lengths of best motifs found by HIGEDA. 

Performance measures 

As in Bi [71], two quantities are used to evaluate the algorithms: LPC, the letter 
level performance coefficient and SPC, the motif site level performance coefficient. If 
we denote delta(.) as the indicator function, and Oi and Ai, respectively, as the set of 
known and predicted motif positions in sequence i, then, 

LPQS) = - Y \A { nOj/lA. uOj , (4.84) 

ntT 

SPC(S) = -V delta(A 1 nO^ Empty) . (4.85) 

Simulated DNA datasets 

We generated simulated DNA datasets as in Bi [71], with three different 
background base compositions: (a) uniform, where A, T, C, G occur with equal 
frequency, (b) AT-rich (AT = 60%), and (c) CG-rich (CG = 60%). The motif string, 
GTCACGCCGATATTG, was merged once (or twice, if needed) into each sequence, 
after a defined level of the string change: (i) 9% change representing limited divergence 
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(i.e. 91% of symbols are identical to the original string), (ii) 21% change, or (iii) 30% 
change (essentially background or random sequence variation). 



Table 4.3 1 : Average performance (LPC/SPC) on simulated DNA datasets 
Motif identity Algorithm Uniform AT-Rich CG-Rich A vg. Runtime 



91% 



HIGEDA 



1.00/1.00 



1.00/1.00 1.00/1.00 



GEMFA 



0.98/1.00 0.98/1.00 1.00/1.00 



GAME 



0.86/0.88 



0.88/0.90 0.91/0.94 



MEME 



1.00/1.00 



1.00/1.00 1.00/1.00 



GLAM2 



1.00/1.00 



1.00/1.00 1.00/1.00 



BioPro. 



0.99/1.00 0.99/1.00 0.94/0.95 



PRATT 



0.83/0.95 



0.88/1.00 0.46/0.80 



EM 



0.99/1.00 0.99/1.00 1.00/1.00 



32s 



22s 



2m26s 



2s 



52s 



2s 



0.2s 



79% 



HIGEDA 



0.87/0.96 



1.00/1.00 1.00/1.00 



GEMFA 



0.87/0.88 



0.87/0.90 0.85/0.89 



GAME 



0.43/0.55 



0.55/0.61 0.64/0.71 



MEME 



0.95/0.95 



1.00/1.00 1.00/1.00 



GLAM2 



1.00/1.00 



0.07/0.25 1.00/1.00 



BioPro. 



0.86/0.92 0.89/0.95 0.94/0.98 



PRATT 



0.46/0.70 0.03/0.15 0.09/0.15 



EM 



0.83/0.87 



0.89/0.91 0.87/0.89 



32s 



18s 



2ml Is 



2s 



57s 



2s 



0.2s 



70% 



HIGEDA 



0.79/0.81 



0.84/0.89 0.65/0.76 



GEMFA 



0.56/0.65 



0.50/0.60 0.52/0.56 



GAME 



0.14/0.28 



0.19/0.34 0.20/0.34 



MEME 



0.44/0.50 



0.75/0.75 0.27/0.30 



GLAM2 



0.95/0.95 



0.01/0.05 0.00/0.05 



BioPro. 



0.26/0.33 



0.39/0.44 0.25/0.33 



PRATT 



0.31/0.40 0.05/0.10 0.19/0.30 



EM 



0.38/0.48 



0.47/0.58 0.48/0.54 



32s 



20s 



lm55s 



2s 



lm20s 



2s 



0.2s 
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In Table 4.31, we compared results with those obtained with seven other 
algorithms. When motif sequences are 91% identical, all algorithms performed equally 
well regardless of base composition. However, when the identity dropped to 79% or 
70%), HIGEDA in general performed as well or significantly better than other 
algorithms on all base compositions. Hence, HIGEDA is not significantly affected by 



noise. 



Finding motifs in biological DNA sequences 

Table 4.32 : Average performance (LPC/SPC/run time (seconds)) on eight DNA 
datasets (# of sequences/length of motif/# of motif occurrences). 



Algo. 


E2F(27/ll/25) 


ERE(25/13/25) 


crp(24/22/18) 


arcA(13/15/13) 


HIGEDA 


0.57/0.96/33s 


0.58/0.87/37s 


0.75/0.90/22s 


0.38/0.53/56s 


GEMFA 


0.64/0.85/39s 


0.74/0.92/43s 


0.57/0.88/12s 


0.32/0.42/26s 


GAME 


.24/.90/2m45s 


.24/.75/2mlls 


.45/.80/2m33s 


.05/.10/lm26s 


MEME 


0.71/0.85/16s 


0.68/0.68/4s 


0.55/0.68/ls 


0.47/0.54/4s 


GLAM2 


.84/.93/lm28s 


.74/.92/lml0s 


.54/.64/lml3s 


.47/.54/lml7s 


PRATT 


0.17/0.33/.2s 


0.24/0.44/.2s 


0.22/0.56/. Is 


0.17/0.23/.4s 


BioPros. 


0. 50/0.63/3s 


0.65/0.71/3s 


0.37/0.45/ls 


0.01/0.01/6s 


Algo. 


argR(17/18/17) 


purR(20/26/20) 


tyrR(17/22/17) 


ihf( 17/48/24) 


HIGEDA 


.75/.94/lm22s 


.90/.93/lm38s 


.34/.41/lml9s 


.14/.31/3m44s 


GEMFA 


0.35/0.38/30s 


0.81/0.90/39s 


0.32/0.49/27s 


0.11/0.26/lm 


GAME 


.36/.55/2m31 


.33/.53/3m23s 


.ll/.23/2m39s 


.07/.18/10m7s 


MEME 


0.75/0.94/5s 


0.58/0.85/4s 


0.43/0.47/3s 


0.00/0.00/5s 


GLAM2 


.38/.47/lm32s 


.17/.50/3mlls 


.35/.35/lm55s 


.12/.33/9m55s 


PRATT 


0.16/0.29/.2s 


0.46/0.95/.2s 


0.06/0. 12/.3s 


0.06/0.25/.5s 


BioPros. 


0.62/0.69/8s 


0.18/0.20/7s 


0.33/0.42/6s 


0.05/0.13/17s 
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We used eight DNA transcription factor binding site datasets, two eukaryotic 
datasets, ERE and E2F, and six bacterial datasets: CRP, Arc A, ArgR, PurR, TyrR and 
IHF. Table 4.32 shows that HIGEDA performed as well or better than other algorithms 
on most datasets. One exception is ERE, possibly because it is an example of OOPS 
which is the motif finding model used by GEMFA. Also, GLAM2 performed better 
than HIGEDA on IHF, which contains a motif of length 48. 

Finding motifs in biological protein sequences 

We selected twelve protein families from PFAM and Prosite representing motifs 
with different levels of sequence specificity, from completely defined to more subtle, 
degenerate and gapped. 



3- 



2- 



1 - 



OJ 




CN CO -* kD 



Figure 4-28 : Strep-H-triad motif by HIGEDA 
Results obtained with HIGEDA are compared to those obtained with other 
algorithms (Table 4.33). The gap parameters of all algorithms, if supported, are set to 
known gap structures. All algorithms identify the unambiguous DUF356 IHPPAH 
motif. For the Strep-H-triad motif, the HIDEGA, MEME and GLAM2 consensus 
sequences are more specific than the known motif, because of the motif decoding 
procedure. Full degeneracy is shown in Figure 4-28. 
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Table 4.33 : Detection of protein motifs (1-8, PFAM; 9-12, Prosite) 



Family (#seq, max L) 
Algorithm (run time) 



Known motif 

Predicted motif 



1. DUF356 (20,158) 
HIGEDA(12s) 
MEME (2s) 
GLAM2(lm58s) 
PRATT (0.1s) 



IHPPAH 
IHPPAH 
IHPPAH 
IHPPAH 
IHPPxH 



2. Strep-H-triad (21,486) 
HIGEDA (40s) 
MEME (5 s) 
GLAM2(lm51s) 
PRATT (0.1s) 



HxxHxH 

HGDHYH 

HGDHYH 

HGDHYH 

HxxHxH 



3. Excalibur (26,296) 
HIGEDA (lm) 
MEME (4s) 
GLAM2 (2m56s) 
PRATT (0.5s) 



D x D x DG xx CE 

D R D [RNGK] DG [IV] [AG]CE 

[WY] Q [GA] [NW] Y Y L K S D 

DRDKDGVACE 

D x D xxx C 



4. Nup-retrotrp (14,1475) 
HIGEDA (3ml 8s) 
MEME (10s) 
GLAM2 (6m55s) 
PRATT (Is) 



GRKIxxxxxRRKx 
SGRKIKTAVRRKK 

W[DE]C[DE][TV]C[LC][VL]QNK[AP][ED] 
SNGKNMFSSSGTS 
FSSS[GP1TxxSx(l,2)RK 



5. Flagellin-C (42, 1074) 
HIGEDA (lm29s) 
MEME (6s) 
GLAM2 (3m35s) 
PRATT (0.07s) 



NRFxSxIxxL 

RA [NDQG] L G A [FV] Q N R 

R [AS] [DNQ] L G A [VF] Q N R 

RADLGAFQNR 

A-x-Q 



6. Xin Repeat (25,3785) 
HIGEDA (5m45s) 
MEME (lml s) 
GLAM2 (6m36s) 
PRATT (6s) 



GDV[KQR] [TSG]x[RKT] WLFETxPLD 
GDV[RK] [ST] [ACT] [RK] WLFETQPLD 
KGDV [RK] T [CA][RK]W[LM]FETQPL 
HKGDVRTCRWLFETQP 
GDVxTxxWxFETxP 



7. Planc.Cyt. C (12,1131) 
HIGEDA (lm) 
MEME (9s) 
GLAM2 (3m42s) 
PRATT (0.02s) 



C{CPWHF} {CPWR}CH{CFYW} 

[NHKS Y] C [ AQELMFT V] [ AGS] CH 

FSPDGK 

RFSPDG 

PDxxxL 



8. zf-C2H2 (115,99) 
HIGEDA (2ml 2s) 



MEME (45 s) 
GLAM2 (19ml 7s) 
PRATT (0.1s) 



Cx(l-5)Cx3#x5#x2Hx(3-6)[HC] 

C[_Y]LRQEK][_C]LEKP][_EGIK]C[_G]K[ARST]F 

S[RQK] [KS] S [NHS]L[NKT] [RKST] H [QILKM]R[IST 

V]H 

EIC[NG]KGFQRDQNLQLHRRGHNLPW 

YKC_P_CGK_FS_KSSLT_H_RI_HT 

Hx(3-5)H 
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Table 4.33 (con't.) 



Family (#seq, max L) 
Algorithm (run time) 



Known motif 

Predicted motif 



9. EGF_2 (16,3235) 
HIGEDA(4ml9) 

MEME (15s) 
GLAM2 (8m) 
PRATT (Is) 



CxCx2 [GP] [FYW]x(4-8)C 

CLEK][_C][_EILSV]C[NDE][NDQEPS]G 

[FWY][AQESTY]G[DS]DCS[GI] 

GECxCNxGYxGSDCSI 

CNx(0-9)GECxCNEGWSGDDC 

Gx(0-l)CxCx5Gx2C 



10. LIG-Cyclin- 1(23,1684) 
HIGEDA (lm8s) 
MEME (18s) 
GLAM2 (2m35s) 
PRATT (0.5s) 



[RK]xLx(0- 1 )[FYLIVMP] 

[RK]R[RL] [RIL] [DEIFY] 

PAPAP 

Tx(0-1)RKP 

Kx(0-1)R 



ll.LIG_PCNA(13, 1616) 

HIGEDA (lm32s) 

MEME (9s) 

GLAM2 (4m59s) 

PRATT (0.3s) 



x(0-3)x{FHWY} [ILM] {P} {FHILVWYP} 

[DHFM][FMY] 

[RGIKP] LQK] [_RDKS] [DPST] [IL] 

[SV]FFG 

GQKTIMSFFS 

Qx(0-1)SIDSFF[K_][R_] 

Lx2Sx(2-4)K 



[KMTY] 



12. Mod_Tyr_Itam (10,317) 
HIGEDA (51s) 



MEME (2s) 
GLAM2 (3m43s) 
PRATT (0.1s) 



[DE]x(2)Yx(2)[LI]xx(6- 1 2)Yx(2)[LI] 

[RDQE] [_S] [_CET] [_IL] [_GM] [_QTY] LQELK] [_D 

GK] [EI] [_RS] [_RN] [_RGP]LQ[DGV] [GHT] YxM[CI 

Y]Q[NGT]L[ILS] 

GKEDDGLYEGLNIDDCATYEDIHM 



E H 



SLAQKSMDH _SRQ 



Lx2Lx(0-l)L 



HIGEDA and GLAM2 identified the exact motif for the moderately degenerate 
Excalibur (Section 2.5.6). Only HIGEDA identified Nup-retrotrp motif erring in only 
one residue and identifying the key arginine (R) residues. No algorithm identified the 
Flagellin-C motif, possibly because of the weak motif signal. While all performed well 
in finding the Xin repeat motif, only HIGEDA identified the Planctomycete 
Cytochrome C motif, erring in only one residue. For the gapped motif ZF-C2H2 family, 
only HIGEDA identified the key C, H, F and L residues and the spacing between them. 
HIGEDA also provided better results on EGF2 and LIGCYCLIN1. No algorithm 



191 



identified the LIG_PCNA and MODTYRITAM motifs. Together, these results show 
that HIGEDA can effectively find more subtle motifs, including those with gaps. 
Because HIGADA uses GA to find motif seeds, its run times are longer than those of 
MEME and PRATT, but shorter than those of GLAM2, as indicated in Table 4.33. 

4.3.2 New motif discovery using HIGEDA 

To test HIGEDA in prediction of novel motifs, we selected the following 
proteins from PFAM: those containing only a ZZ domain, only a SWIRM domain, a ZZ 
plus a Myb domain, and a ZZ plus both a Myb and a SWIRM domain. While ZZ 
domains have a consensus sequence in PFAM, SWIRM and MYB proteins are defined 
only by experimental demonstration of functional properties and protein sequence 
alignments. We ran HIGEDA on these groups varying the output motif length from 4 to 
33. We defined statistically significant motifs as those found in more than 85% of 
sequences, and having a p-value less than 0.0001 (computed using Touzet and Varre 
[82]. We also ran MEME, GLAM2 and PRATT. 

In Table 4.34, we first show that HIGEDA, MEME and PRATT successfully 
identified patterns of the known ZZ motif, CX(1-5)C. No motif was discovered in the 
SWIRM-containing proteins. But HIGEDA, MEME and GLAM2 each discovered a 
motif, WxAxEELLLL, common to Myb proteins, which we propose as a novel domain. 
Uniquely discovered by HIGEDA, we propose a second novel domain, 
GNW[AQ]DIADH[IV]G[NGST], of unknown function, which is specific to the ZZ- 
SWIRM-Myb protein set. 
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Table 4.34 : Motifs of ZZ, Myb and SWIRM domains by the four algorithms 



Algorithm 


Domain / Found motif 


p-value 


% Hit 


ZZ (67,5125) 


HIGEDA 


D[FY]DLC[AQES]xC-[EYV] 


5.76e-7 


91.67 




C[GHPY]D[FY]DLC[AQES]xC 


2.25e-6 


93.06 


MEME 


CLICPDYDLC 




73.00 


GLAM2 


HSRDHP[ML] [IL] [QRK] 




91.00 


PRATT 


Cx2C 




92.00 


SWIRM (49,624): No motif found 


ZZ-Myb (35,2697) 


HIGEDA 


[ND]W[ST]A[DE]EELLLL 


3.81e-9 


97.14 


MEME 


WTADEELLLLD 




100.0 


GLAM2 


WTAEEELLLLE 




97.00 


PRATT 


Ex2Lx(4-5)E 




96.00 


ZZ-Myb-SWIRM (43,1049) 


HIGEDA 


GNW[AQ]DIADH[IV]G[NGST] 


3.6e-10 


100.0 




WGADEELLLLEG 


2.4e-10 


100.0 


MEME 


WGADEELLLLEG 




100.0 


GLAM2 


KQLCNTLRILPK 




85.00 


PRATT 


Wx( 1 -2)D[ADEQ]Ex(2-3)L[ILV] 




100.0 



Lastly, we selected the protein families, Presenilin-1 and Signal peptide 
peptidase, that share the conserved sequence GXGD and are grouped in the same clan 
in PFAM, as shown in Table 4.35. 

Only HIGEDA successfully discovered motifs containing GLGD in these 
families. In addition, two patterns, YDIFWVF, which is also identified by MEME and 
GLAM2, and FGT[NDP]VMVTVA[KT], which is also identified by MEME, appear 
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frequently (in 86% of 151 protein sequences) in the Signal peptide peptidase family. 
We, therefore, propose these patterns as new motifs of this family. 



Table 4.35 : Motifs of Presenilin-1 and Signal peptide peptidase by HIGEDA 



HIGEDA 



Algorithm 


Family (# seq, max) / Found motif 


p-value 


% Hit 


Presenilin-1 (86,622) 


HIGEDA 


GLGDFIFYS 




1.3e-10 


85.00 




KLGLGDFIFY 




1.6e-10 


85.00 


MEME 


HWKGPLRLQQ 






66.00 


GLAM2 


GDFIFYSLVL 






85.00 


PRATT 


Lx(l-3)Lx(2-3)I 






100.0 




Signal p. Peptidase (151, 


,690) 







F[AS]MLGLGDIVIPG 


8.9e-08 


88.74 


YDIFWVF[GF]T[NDP] 


1.7e-08 


86.75 


GLF[IFV]YDIFWVF 


8.7e-08 


86.75 



FGT[NDP]VMVTVA[KT] 



8.6e-08 



86.09 



FWVFGT[ND]VMV 
MEME YDIFWVFGT[NDP]VMV 

[AS]MLGLGDIVIPGI 



86.00 
87.00 
92.00 



GLAM2 



[FL][FI]YD[IV]F[WF]VF[GF] 
GL FFYDIFWVFGT 



95.00 
90.00 



PRATT 



Lx3F 



100.0 



HIGEDA integrates HGA, EM and dynamic programming (DP) to find motifs in 
biological sequences. HIGEDA uses HGA to manage different motif seeds and to 
escape local optima more efficiently than conventional GAs by using gene-set levels. 
By applying DP and proposing a new technique for aligning sequences to a motif 
model, and then using EM with the dynamic alignments, the HIGEDA creates a new 
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way to automatically insert gaps in motif models. Using the gradient descent learning 
law, HIGEDA effectively estimates its model parameters without the greedy searches 
used in MEME. Using the pseudo-counts method based on a simple mechanism and the 
BLOSUM62 substitution probability matrix, HIGEDA avoids the small dataset problem 
of zeros in motif model elements, without using the Dirichlet mixture prior knowledge 
used in MEME and similar approaches. Because HIGEDA uses a set of motif seeds 
generated randomly, it outperforms MEME, GLAM2 and several other algorithms in 
finding subtle, more degenerate and gapped motifs. Lastly, we have shown that, as a 
TCM based algorithm, HIGEDA can identify novel motifs. 
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5. Applications 

5.1 Recovery of a gene expression signature 

In this application, we apply our methods to recover the estrogen gene expression 
signature from the CMAP dataset. The flowchart is shown in Figure 5-1. 



Integrated gene 
expression data 



Cluster solution 
Competitors 




fzGASCE : Genetic 

algorithm with fSC 

and fzBLE 



Cluster candidates 
Fuzzy partition (1) 




Done 



fzSC : Fuzzy 
Subtractive Clustering 



Potential offsprings 



Immunized 
candidates 



Generation 
selection 



Figure 5-1 : Gene expression signature prediction 

The fzGASCE with fzSC, fzBLE, fzPBD and RPPR methods were applied to 

the estrogen treatment datasets in CMAP with the estrogen gene expression signature 

[85]. We first preprocessed the CMAP arrays using the RMA method to obtain 564 

expression profiles. Using our meta-analysis method, RPPR [169], we selected the 
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expression profiles that were most strongly related to the estrogen expression signature 
[25, 26]. Expression profiles from the five experiments, #373, 988, 1021, 1079 and 
1113 were selected. We then used the Rank-Product method [30, 54] across multiple 
experiments to filter probes differentially expressed using a p-value cutoff at 0.005. 
This process selected 204 probes. 

The expression levels of these 204 probes across the selected profiles were 
extracted and the fzGASCE method clustering was applied using Pearson Correlation 
distance. Four clusters were found. Each cluster, a set of probesets, is considered a 
candidate estrogen gene expression signature. We applied RPPR to each cluster using 
CMAP. Significantly related expression profiles, with cutoffs of 0.5 and -0.5 for 
positive and negative scores respectively, for each cluster are shown in Table 5.1. 

Table 5.1 : Clustering results with estrogen significantly associated drugs 



Cluster 
(#probesets) 



Associated drugs [CMAP experiments] (+/-hitscore) 



1 (60) estradiol[373](+0.93), estradiol[1021](+0.85), estradiol[988] (+0.733), 

estradiol[1079](+0.69), doxycycline[l 1 13](+0.64), 

genistein[1073](+0.63), estradiol[365](+0.63), nordihydroguaiaretic 
acid[415](+0.59), alpha-estradiol[1048](+0.58), genistein[268](+0.57), 
estradiol[414](+0.57), genistein[1015](+0.56), dexamethasone[374](+0.54), 
estradiol[121](+0.53), alpha-estradiol[990](+0.52), genistein[267](+0.52), 
fulvestrant[1076](-0.75), fulvestrant[1043](-0.71), fulvestrant[985](-0.66), 
fulvestrant[367](-0.57), 5224221 [956](-0.55), 15-delta prostaglandin 
J2[1011](-0.54), 5230742[862](-0.54), fulvestrant[310](-0.52) 



2 (53) doxycycline[l 1 13](+0.85), estradiol[988](+0.74), estradiol[1021](0.61), 

estradiol[373](0.58), 12,13-EODE[1108](+0.56) 



3(44) estradiol[988](+0.84), estradiol[1079](+0.80), estradiol[1021](+0.58), 

alpha-estradiol[990](+0.52) 



4 (47) estradiol[1021](+0.92), estradiol[988](+0.66), doxycycline[1113](+0.61), 

estradiol[1079](+0.58) 
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Table 5.2 : Estrogen signature and the 'signature' predicted by FZGASCE 



Affy-id 


Selected 


expression profiles 


predicted 


373 


988 


1021 


1079 


1113 


202437_s_at 


1.72 


1.92 


1.97 


1.68 


1.75 


X 


206115_at 


1.29 


1.29 


1.28 


1.40 


1.54 


X 


209339_at 


1.66 


1.80 


1.77 


1.76 


1.82 


X 


209687_at 


1.51 


1.66 


1.59 


1.63 


1.65 


X 


210367_s_at 


1.62 


1.77 


1.64 


1.70 


1.71 


X 


210735_s_at 


1.71 


1.73 


1.71 


1.81 


1.81 


X 


211421_s_at 


1.35 


1.55 


1.47 


1.63 


1.59 


X 


213906_at 


1.29 


1.48 


1.45 


1.43 


1.62 


X 


215771_x_at 


1.45 


1.69 


1.71 


1.79 


1.63 


X 


215867_x_at 


1.69 


1.82 


1.77 


1.70 


1.82 


X 


205879_x_at 


1.50 


1.81 


1.68 


1.68 


1.95 


X 


205862_at 


1.69 


1.81 


1.69 


1.69 


1.72 


X 


203666_at 


1.55 


1.61 


1.60 


1.60 


1.68 


X 


203963_at 


1.71 


1.88 


1.71 


1.85 


1.81 


X 


204508_s_at 


1.60 


1.82 


1.76 


1.93 


1.91 


X 


204595_s_at 


1.78 


1.80 


1.62 


1.75 


1.99 


X 


205380_at 


1.31 


1.59 


1.50 


1.64 


1.84 


X 


205590_at 


1.69 


1.71 


1.87 


1.82 


1.79 


X 


217764_s_at 


1.62 


1.83 


1.74 


1.85 


1.84 


X 


204597_x_at 


1.70 


1.74 


1.75 


2.02 


1.69 


- 


205239_at 


1.71 


1.73 


1.77 


1.62 


1.81 


- 


204497_at 


1.79 


1.76 


1.70 


1.79 


1.84 


- 



Table 5.1 shows that cluster 1 strongly hit the estrogen expression profiles in 
CMAP. It is also the only cluster with a significant negative correlation with fulvestrant, 
an estrogen antagonist. This cluster is therefore the best candidate for the estrogen gene 
expression signature. Interestingly, this cluster contains 19 of 22 probesets from the 
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actual estrogen signature that were significantly regulated in the five selected profiles 
(Table 5.2). We therefore conclude that fzGASCE successfully discovered the estrogen 
gene expression signature from the CMAP datasets. 

5.2 Gene-drug association prediction 

Our algorithm methods can be combined in a solution model for the problem of 
gene-drug association prediction. Given a set of genes that may describe completely or 
partially the biological phenotype of interest, the problem is to search the CMAP 
datasets for drug treatments that directly or indirectly alter their expression. 

5.2.1 Model design 

Let A be the given set of genes. 






Gene set A 



Expanded set B 



- 



Key gene set B^ 



Gene expression 
signature generation 



r 



Predicted gene-drug 
associations 






Figure 5-2 : Gene-Drag association prediction 



199 



Gene set expansion 



Given A, a set of 

genes describe the 
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Apply KEGG pathway 
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r Select the genes having at 
least one interaction 
partner that is annotated in 
KEGG 



1 



Selected gene set: B 
AcB 





\^J Genes of the given set A 

KEGG-pathway annotated 
\_y Expanded set: B 



Figure 5-3 : Expansion of gene set A 

Because our knowledge of the molecular basis of the phenotype is limited, the set 
A probably does not describe the phenotype completely. To address this problem, we 
propose to expand the gene set using Protein-Protein Interaction (PPI) and pathway 
data. A gene g is added to set A if g interacts with at least one gene in set A, and g 
and/or its interaction partner in A is annotated as a component of some pathway. These 
criteria are chosen because, while physically interacting proteins are not necessarily 
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components of a biological process, a pair of interacting proteins which are associated 
with one pathway is more likely to participate in another pathway. 

We first use PPI data from the Human Protein Reference Database (HPRD) to 
expand set A. We then use KEGG pathway data to select proteins that meet the criteria. 
If a gene g, geA, is annotated as a component of any pathway, we will include genes 
that interact with g, whether or not they are annotated as pathway components. If gene g 
is not a component of any pathway, its interaction partners will be included only if they 
are annotated as components of at least one pathway. These included genes then suggest 
a novel potential role for g. 

Let B be the expanded gene set. B contains the set A and the proteins that interact 
with A, and may mediate the association of proteins in A with some pathways. Every 
pair of genes in B is annotated in at least one pathway. For example, consider a set of 
three genes, a, b, c, and their interaction partners, d, e, f (Figure 5-3). The interaction 
between the genes f and b is not annotated in any KEGG pathway. Therefore, f is not 
selected. The expanded set B then contains the genes a, b, c, d and e. 

The key genes 

Set B may be large, and it is possible that not all of its members are involved in a 
single pathway relevant to the biological phenotype. The next step, therefore, is to 
search for the genes that are most likely to interact with genes in A (Figure 5-4). We 
call these the key genes associated with the phenotype. Key genes may be members of 
A or the genes that interact with members of A. 
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r 



Annotate B using 
GO CC terms 



Create M B , a CC-terms based 
distance matrix for genes in B 

1 
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ofB:Bi,B 2 ,...,B k 



Select B A from {B i? i=l,k} 

where B A is the closest to A 

using GO BP based similarity 



Radius BA using PPI: BA , create 
the expression profiles SA for BA 
using random walk and heuristics 



Figure 5-4 : Identification of key genes 

We cluster the genes in B into groups where genes of the same group most often 
occupy the same cellular location to facilitate the interaction. We use the GO cellular 
component (CC) annotations to measure the distances between genes in B and then 
apply our clustering methods to classify genes in B into groups, Bi, B 2 ,..., Bk. These 
subsets may be the set A itself or have a member that interacts with at least one member 
of A and is located within the same cellular locations as that member of A. These sets 
Bi, i=l,. . .,k, are not necessarily related to A in a biological process, we therefore search 
for B^, where B^ e{Bi}i=i..k, and B^ is functionally closest to A. We address this 
problem using a semantic similarity measure based on GO Biological Process (BP) 
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annotations. The selected set B^ should have the highest semantic similarity to A. We 
define B A , B A = A u B^ , the key genes of the expanded set B of set A. Since the set B A 
was generated by examining known interactors of A that occupy similar cellular 
locations and are annotated to pathways, Ba is believed to better describe the biological 
phenotype of interest than the set A. 

Generate and apply gene expression signature 

In this step, we use B A and the CMAP gene expression microarray data to 
generate a gene expression signature for the biological phenotype. Because not all the 
genes in B A exhibit a change in all expression profiles in the dataset, the key issues in 
this step are to select expression profiles and determine the direction change in 
expression of genes in B A . We extract the fold change (FC) levels of genes in B A using 
all gene expression profiles in CMAP, compute p-values from FC to filter genes 
differentially expressed in each expression profile using Rank Product [30], and conduct 
cluster analysis to group all CMAP expression profiles over genes in B A into p groups, 
{Pi}i=i..p- We choose P A as the group of samples where the genes in B A are the most 
changed. We use the selected genes to define the gene expression signature, S A , for the 
phenotype. The FC levels of genes in S A are averaged across the expression profiles in 
P A to determine the expression direction. S A is the gene expression signature we suggest 
for the phenotype. 
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Figure 5-5 : Gene expression signature generation and application 

The predicted expression signature, Sa, is then used to identify a subset of 
CMAP, CMAPa which contains the expression profiles strongly relevant to the 
phenotype. The small-molecule/drugs used as treatments in CMAP A suggest potential 
therapeutics to modulate expression of the given gene set. By conducting cluster 
analysis on CMAP A and using the GO BP-based semantic similarity measure, we can 
also obtain the set Ca that contains additional genes potentially related to the biological 



204 



phenotype of interest, and the set C c a of additional groups of genes changed, but 
differently that did the genes in the given set A. 

5.2.2 Model testing 

Intellectual disability (ID) in Down syndrome (DS) ranges from low normal to 
severely impaired, and has a significant impact on the quality of life of the individuals 
affected and their families. Development of pharmacotherapies for cognitive deficits is, 
therefore, an important goal. The working hypothesis in DS research is that the 
phenotypic features are caused by the modest 50% increases in expression of trisomic 
genes and that these in turn result in modest perturbations of otherwise normal cell 
functions and pathways. However, the primary difficulties of targeting therapeutics to 
Hsa21q genes are the determinations of the number of genes to be regulated 
simultaneously, the number of drugs to be used, and the gene and/or drug interactions 
might occur in full trisomies to produce adverse effects [171]. This application will help 
with identification of the set of drug treatments that may affect a given set of Hsa21q 
genes using the CMAP datasets. 

Hsa21 gene set 1 

This gene set contains six Hsa21q genes: APP, DYRK1A, ITSN1, RCAN1, 
S100B and TIAM1 [171] as shown in Table 5.3. 

1) As in Figure 5-3, to generate the expanded gene set, we used the PPI and 
pathway database in the Gardiner's lab [167] which is compiled from 
multiple bioinformatics resources [168], (available online at 
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http://gfuncpathdb.ucdenver.edu/iddrc/). We used the HPRD PPIs at "in- 
vivo" level and the KEGG pathways to filter the genes. The expanded set B 
of A is shown in Table 5.4. 

Table 5.3 : Hsa21 set-1 query genes 



No. 


Gene 


Description 


Chr. 


1 


APP 


amyloid beta (A4) precursor protein 


21 


2 


DYRK1A 


dual-specificity tyrosine-(Y)-phosphorylation regulated kinase 1 A 


21 


3 


ITSN1 


intersectin 1 (SH3 domain protein) 


21 


4 


RCAN1 


regulator of calcineurin 1 


21 


5 


S100B 


SI 00 calcium binding protein B 


21 


6 


TIAM1 


T-cell lymphoma invasion and metastasis 1 


21 



2) As in Figure 5-4, we conduct cluster analysis on B using fzGASCE and 
GOfzBLE methods with GO-CC annotations. Gene groups are then ranked 
based on how close they are to the given gene set A using GO-BP based 
semantic similarity measurement. Results are shown in Table 5.5. We 
choose the first group, which has the highest semantic similarity score to the 
set A, for the next steps. 
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Table 5.4 : Expanded set of the Hsa21 set-1 

Selected proteins in BOLD are annotated in at least one pathway 



# 


Selected protein 


Interactor 


1 


APP 


SERPINA3, ABL1, APBA1, APBA2, APBB1, APBB2, 
CALR, CASP3, CDK5, CHRNA7, CLU, CTSD, F12, GRB2, 
GSK3B, GSN, HSD17B10, HMOX2, KNG1, KLC1, LRP1, 
NF1, PIN1, PPID, PSEN2, SHC1, TGFB1, TGFB2, NUMB, 
NAE1, NUMBL, ITM2B, HOMER3, HOMER2, MAPK8IP1, 
APBA3, BCAP31, APBB3, KAT5, NCSTN, BACE2, 
LDLRAP1, HTRA2, LRP1B, SHC3, NECAB3, COL25A1 


2 


DYRK1A 


CREB1, FOXOl, GLI1, YWHAB, YWHAE, YWHAG 


3 


ITSN1 


DNM1, EPHB2, EPS15, HRAS, SNAP25, SOS1, SNAP23, 
WASL 


4 


RCAN1 


MAP3K3, PPP3CA 


5 


S100B 


GUCY2D, MAPT, IQGAP1 


6 


TIAM1 


ANK1, CAMK2G, CD44, EFNB1, EPHA2, HRAS, MYC, 
NME1, PRKCA, RAC1, SRC, TIAM1, YWHAG, 
MAPK8IP1, MAPK8IP2, PARD3, PPP1R9B 



Table 5.5 : Gene clusters of the Hsa21 gene set-1 expanded set 
Genes in BOLD are in the original gene set A 



# 


Genes 


Similarity 
score to A 


1 


APBA1(9), APBA2(15), APBBl(ll), APBB2(4), APP(21), CHRNA7(15), CLU(8), 
CREB1(2), CTSD(ll), DNM1(9), RCAN1(21), DYRK1A(21), EPHA2(1), 
EPHB2(1), KNG1(3), LRP1(12), MYC(8), NF1(17), PIN1(19), PPID(4), PSEN2(1), 
S100B(21), SNAP25(20), TGFB1(19), TGFB2(1), SNAP23(15), IQGAP1(15), 
NUMBL(19), HOMER3(19), KAT5(11), LRP1B(2), PPP1R9B(17) 


0.32 


2 


SERPINA3(14), ABL1(9), ANK1(8), CALR(19), CAMK2G(10), CASP3(4), 
CD44(11), CDK5(7), EFNBl(X), EPS15(1), F12(5), FOX01(13), GLI1(12), 
GRB2(17), GSK3B(3), GSN(9), GUCY2D(17), HSD17B10(X), HMOX2(16), 
HRAS(ll), KLC1(14), MAPT(17), MAP3K3(17), NME1(17), PPP3CA(4), 
PRKCA(17), RAC1(7), ITSN1(21), SHC1(1), SOSl(2), SRC(20), TIAM1(21), 
YWHAB(20), YWHAE(17), YWHAG(7), NUMB(14), NAE1(16), WASL(7), 
ITM2B(13), HOMER2(15), MAPK8IP1(11), APBA3(19), BCAP31(X), APBB3(5), 
NCSTN(l), MAPK8IP2(22), BACE2(21), LDLRAPl(l), HTRA2(2), SHC3(9), 
PARD3(10), NECAB3(20), C0L25A1(4) 


0.26 
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3) As in Figure 5-5, we search for the probesets of the selected genes and 
extract their expression levels. Differential expression significance is set 
using a cutoff p-value at 0.005. We conduct cluster analysis of expression 
profiles using our cluster method, fzGASCE where the fuzzifier factor, m, is 
set to 1.15. The expression profile clusters are ranked based on the average 
fold change of genes that are differentially expressed. The cluster with 
highest expressed level is selected. 

4) We average the expression levels of the selected probesets (Sa) across 
selected profiles and assign the 'Down' label to probesets where the 
averaged FC value is less than 1 and vice versa. Results are shown in Table 
5.6. We propose these genes with their direction of expression change as the 
expression signature for the phenotype most related to the given set A of 
Hsa21 genes. 

5) We use the gene expression signature in Table 5.6 with RPPR [169] (support 
tool is available online [168]) to select the list of drug treatments in CMAP 
using a score cutoff at 0.8. Results are shown in Tables 5.7 and 5.8. Our 
approach successfully identifies drugs commonly used in treatment of 
mental illness. Mebhydrolin is used to treat depression in older ADULT with 
ID; cytochalasin B causes filament shortening, however, Krucker et al. [186] 
showed that the effect of cytochalasin B can be reversible making stable 
long-term potentiation in region CA1 of the hippocampus. Geldanamycin 
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can help with reducing brain injury; Levomepromazine is used to relieve 
symptoms of schizophrenia. All of these drugs are listed in Table 5.8. 

Table 5.6 : Proposed gene expression signature for the Hsa21 gene set-1 









(67 g 


enes, 147 pr 


obesets) 








No 


Probeset 


Sign 


Gene 


Chr. 


No 


Probeset 


Sign 


Gene 


Chr. 


1 


202123 s at 


i 


ABL1 


9 


75 


206689 x at 


i 


KAT5 


11 


2 


207087 x at 


i 


ANK1 


8 


76 


209192 x at 


i 


KAT5 


11 


3 


208352 x at 


i 


ANK1 


8 


77 


214258 x at 


i 


KAT5 


11 


4 


205389_s_at 


i 


ANK1 


8 


78 


212877_at 


i 


KLC1 


14 


5 


208353 x at 


i 


ANK1 


8 


79 


212878 s at 


i 


KLC1 


14 


6 


205391 x at 


i 


ANK1 


8 


80 


213656 s at 


i 


KLC1 


14 


7 


205390 s at 


i 


ANK1 


8 


81 


217512 at 


i 


KNG1 


3 


8 


206679 at 


i 


APBA1 


9 


82 


206054 at 


i 


KNG1 


3 


9 


215148 s at 


i 


APBA3 


19 


83 


57082 at 


i 


LDLRAP1 


1 


10 


205146 x at 


i 


APBA3 


19 


84 


221790 s at 


i 


LDLRAP1 


1 


11 


202652 at 


i 


APBB1 


11 


85 


203514 at 


i 


MAP3K3 


17 


12 


204650 s at 


i 


APBB3 


5 


86 


213013 at 


i 


MAPK8IP1 


11 


13 


211277 x at 


i 


APP 


21 


87 


213014 at 


i 


MAPK8IP1 


11 


14 


200602 at 


i 


APP 


21 


88 


205050 s at 


i 


MAPK8IP2 


22 


15 


214953 s at 


i 


APP 


21 


89 


208603 s at 


i 


MAPK8IP2 


22 


16 


217867 x at 


i 


BACE2 


21 


90 
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MAPT 


17 


17 


212952 at 


i 


CALR 


19 


91 


203928 x at 


i 


MAPT 


17 


18 


212953 x at 


i 


CALR 


19 


92 


203929 s at 


i 


MAPT 


17 


19 


214315 x at 


i 


CALR 


19 


93 


203930 s at 


i 


MAPT 


17 


20 


214316 x at 


i 


CALR 


19 


94 


20243 l_s_at 


i 


MYC 


8 


21 


200935 at 


i 


CALR 


19 


95 


202268 s at 


i 


NAE1 


16 


22 


212669 at 


i 


CAMK2G 


10 


96 


208759 at 


i 


NCSTN 


1 


23 


214322 at 


i 


CAMK2G 


10 


97 


210720 s at 


i 


NECAB3 


20 


24 


212757 s at 


i 
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10 


98 
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NUMB 


14 


25 


202763 at 


i 


CASP3 


4 


99 
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i 


NUMB 


14 


26 


204490 s at 


i 
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11 
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221280_s at 


i 


PARD3 


10 


27 


212014_x_at 


i 


CD44 


11 
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210094_s_at 


i 
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10 


28 


209835 x at 
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CD44 


11 
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221526 x at 


i 


PARD3 


10 


29 


210916 s at 


i 


CD44 


11 


103 


221527 s at 


i 


PARD3 


10 


30 


212063 at 


i 


CD44 


11 


104 


202927_at 


i 


PIN1 


19 


31 


217523 at 


i 


CD44 


11 


105 


204186 s at 


i 


PPID 


4 


32 


216056 at 


i 


CD44 


11 


106 


204185 x at 


i 


PPID 


4 


33 


204489 s at 


i 


CD44 


11 


107 


202429 s at 


i 


PPP3CA 


4 


34 


204247 s at 


i 


CDK5 


7 


108 


202457 s at 


i 


PPP3CA 


4 


35 


210123 s at 


i 


CHRNA7 


15 


109 


202425 x at 


i 


PPP3CA 


4 


36 


208791 at 


i 


CLU 


8 


110 


215194 at 


i 


PRKCA 


17 


37 


208792 s at 


i 


CLU 


8 


111 


213093 at 


i 


PRKCA 


17 


38 


222043 at 


i 


CLU 


8 


112 


215195 at 


i 


PRKCA 


17 


39 


214513 s at 


i 


CREB1 


2 


113 


206923 at 


i 


PRKCA 


17 


40 


204312 x at 


i 


CREB1 


2 


114 


211373_s_at 


i 


PSEN2 


1 


41 


204314 s at 


i 


CREB1 


2 


115 


204262 s at 


i 


PSEN2 


1 


42 


204313 s at 


i 


CREB1 


2 


116 


20426 lsat 


i 


PSEN2 


1 


43 


200766 at 


i 


CTSD 


11 


117 


208641 s at 


i 


RAC1 


7 


44 


209033 s at 


i 


DYRK1A 


21 


118 


209686 at 


i 


S100B 


21 
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Table 5.6 (con't.) 



No 


Probeset 


Sign 


Gene 


Chr. 


No 


Probeset 


Sign 


Gene 


Chr. 


45 


211541 s at 


i 


DYRK1A 


21 


119 


202376 at 


i 


SERPINA3 


14 


46 


211079 s at 


i 


DYRK1A 


21 


120 


214853 s at 


i 


SHC1 


1 


47 


203499 at 


i 


EPHA2 


1 


121 


201469 s at 


i 


SHC1 


1 


48 


217887 s at 


i 


EPS 15 


1 


122 


217048 at 


i 


SHC1 


1 


49 


217886 at 


i 


EPS 15 


1 


123 


206330 s at 


i 


SHC3 


9 


50 


215961 at 


i 


F12 


5 


124 


209131 s at 


i 


SNAP23 


15 


51 


205774 at 


i 


F12 


5 


125 


209130 at 


i 


SNAP23 


15 


52 


202724 s at 


i 


FOXOl 


13 


126 


214544 s at 


i 


SNAP23 


15 


53 


202723 s at 


i 


FOXOl 


13 


127 


202508 s at 


i 


SNAP25 


20 


54 


206646 at 


i 


GLI1 


12 


128 


202507 s at 


i 


SNAP25 


20 


55 


215075 s at 


i 


GRB2 


17 


129 


212777 at 


i 


SOS1 


2 


56 


209945 s at 


i 


GSK3B 


3 


130 


212780 at 


i 


SOS1 


2 


57 


200696 s at 


i 


GSN 


9 


131 


221284 s at 


i 


SRC 


20 


58 


214040 s at 


i 


GSN 


9 


132 


221281 at 


i 


SRC 


20 


59 


218120 s at 


i 


HMOX2 


16 


133 


213324 at 


i 


SRC 


20 


60 


218121 at 


i 


HMOX2 


16 


134 


203085 s at 


i 


TGFB1 


19 


61 


217080 s at 


i 


HOMER2 


15 


135 


203084 at 


i 


TGFB1 


19 


62 


212983 at 


i 


HRAS 


11 


136 


209908 s at 


i 


TGFB2 


1 


63 


203089 s at 


i 


HTRA2 


2 


137 


220407 s at 


i 


TGFB2 


1 


64 


211152 s at 


i 


HTRA2 


2 


138 


209909 s at 


i 


TGFB2 


1 


65 


213446 s at 


i 


IQGAP1 


15 


139 


206409 at 


i 


TIAM1 


21 


66 


210840 s at 


i 


IQGAP1 


15 


140 


213135 at 


i 


TIAM1 


21 


67 


200791 s at 


i 


IQGAP1 


15 


141 


205810 s at 


i 


WASL 


7 


68 


217731 s at 


i 


ITM2B 


13 


142 


205809 s at 


i 


WASL 


7 


69 


217732 s at 


i 


ITM2B 


13 


143 


217717 s at 


i 


YWHAB 


20 


70 


209297 at 


i 


ITSN1 


21 


144 


208743 s at 


i 


YWHAB 


20 


71 


35776 at 


i 


ITSN1 


21 


145 


210317 s at 


i 


YWHAE 


17 


72 


207322 at 


i 


ITSN1 


21 


146 


210996 s at 


i 


YWHAE 


17 


73 


210713 at 


i 


ITSN1 


21 


147 


213655 at 


i 


YWHAE 


17 


74 


209298 s at 


i 


ITSN1 


21 













Table 5.7 : Drugs predicted to enhance expression of Hsa21 gene set-1 



Drug (concentration) 


Target or disease 


Estradiol (lOnM) 


Treatment of urogenital symptoms associated with post- 
menopausal atrophy of the vagina and/or the lower urinary 
tract. 


4,5-dianilinophthalimide (lOmM) 


Potent inhibitors of amyloid P fibrillization. 


Colforsin (50mM) 


Novel drug for the treatment of acute heart failure. 


Fisetin (50mM) 


A compound that has been shown to be able to alleviate 
ageing effects in certain model organisms. 
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Table 5.8 : Drugs predicted to repress expression of Hsa21 gene set-1 



Drug (concentration) 


Target or disease 


Mebhydrolin (4.8mM) 


Enhanced sedation with alcohol or other CNS depressants. 
Additive antimuscarinic effects with MAOIs, atropine, 
TCAs which are the treatment of major depressive 
disorder, panic disorder and other anxiety disorders, 
causing increased levels of norepinephrine and serotonin 
that act as communication agents between different brain 
cells. 


Cytochalasin B (20.8mM) 


A cell-permeable mycotoxin. It inhibits cytoplasmic 
division by blocking the formation of contractile 
microfilaments. It inhibits cell movement and induces 
nuclear extrusion. 


Geldanamycin (l.mM) 


A benzoquinone ansamycin antibiotic that binds to Hsp90 
and inhibits its function. HSP90 proteins play important 
roles in the regulation of the cell cycle, cell growth, cell 
survival, apoptosis, angiogenesis and oncogenesis. 


Tanespimycin (l.mM) 


It helps cause the breakdown of certain proteins in the cell, 
and may kill cancer cells. It is a type of antineoplastic 
antibiotic and a type of HSP90 inhibitor. 


Vorinostat (lO.mM) 


A drug that is used to treat cutaneous T-cell lymphoma that 
does not get better, gets worse, or comes back during or 
after treatment with other drugs. It is also being studied in 
the treatment of other types of cancer. Vorinostat is a type 
of histone deacetylase inhibitor. 


Trichostatin A (l.mM , lOO.nM) 


An organic compound that serves as an antifungal 
antibiotic and selectively inhibits the class I and II 
mammalian histone deacetylase families of enzymes, but 
not class III HDACs. 
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Table 5.8 (con't.) 



Drug (concentration) 


Target or disease 


Levomepromazine (9.mM) 


Levomepromazine is used in palliative care to help ease 
distressing symptoms such as pain, restlessness, anxiety 
and being sick. It works on chemical substances acting on 
the nervous system in brain. Levomepromazine is also 
used to relieve symptoms of schizophrenia. Schizophrenia 
is a mental health condition that causes disordered ideas, 
beliefs and experiences. Symptoms of schizophrenia 
include hearing, seeing, or sensing things that are not real, 
having mistaken beliefs, and feeling unusually suspicious. 



Hsa21 gene set 2 



Given a set of Hsa21 genes: APP, DSCAM, ITSN1, PCP4, RCAN1, TIAM1. 



Table 5.9 : Hsa21 set-2 query genes 



No. 


Gene 


Description 


Chr. 


1 


APP 


amyloid beta (A4) precursor protein 


21 


2 


DSCAM 


Down syndrome cell adhesion molecule 


21 


3 


ITSN1 


intersectin 1 (SH3 domain protein) 


21 


4 


PCP4 


Purkinje cell protein 4 


21 


5 


RCAN1 


regulator of calcineurin 1 


21 


6 


TIAM1 


T-cell lymphoma invasion and metastasis 1 


21 



Table 5.10 shows the predicted gene expression signature for Hsa21 gene set 2. 
We apply the gene expression signature to search for the drug treatments that may affect 
genes in the given gene set using CMAP datasets. A list of drug treatments that 
positively regulates genes in the given set is shown in Table 5.1 1. Table 5.12 lists drugs 
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predicted to negatively affect these genes. These include geldanamycin which can help 
with reducing brain injury, podophyllotoxin which is used in the treatment of brain 
tumors, and chlorpromazine which is used in the treatment of various psychiatric 
illnesses. 

Table 5.10 : Predicted gene expression signature for the Hsa21 gene set-2 

(70 genes, 157 probesets) 



No. 


AffylD 


Sign 


Gene 


Chr. 


No. 


AffylD 


Sign 


Gene 


Chr. 


1 


202123 s at 


i 


ABL1 


9 


80 


210713 at 


i 


ITSN1 


21 


2 


205390 s at 


i 


ANK1 


8 


81 


209192_x_at 


i 


KAT5 


11 


3 


208352 x at 


i 


ANK1 


8 


82 


214258 x at 


i 


KAT5 


11 


4 


207087_x_at 


i 


ANK1 


8 


83 


206689 x at 


i 


KAT5 


11 


5 


205389 s at 


i 


ANK1 


8 


84 


212877_at 


i 


KLC1 


14 


6 


20539 lxat 


i 


ANK1 


8 


85 


212878_s_at 


i 


KLC1 


14 


7 


208353 x at 


i 


ANK1 


8 


86 


213656 s at 


i 


KLC1 


14 


8 


206679_at 


i 


APBA1 


9 


87 


206054 at 


i 


KNG1 


3 


9 


209870 s at 


i 


APBA2 


15 


88 


217512 at 


i 


KNG1 


3 


10 


209871 s at 


i 


APBA2 


15 


89 


57082 at 


i 


LDLRAP1 


1 


11 


215148 s at 


i 


APBA3 


19 


90 


221790 s at 


i 


LDLRAP1 


1 


12 


205146 x at 


i 


APBA3 


19 


91 


200784 s at 


i 


LRP1 


12 


13 


202652 at 


i 


APBB1 


11 


92 


200785 s at 


i 


LRP1 


12 


14 


212972 x at 


i 


APBB2 


4 


93 


203514 at 


i 


MAP3K3 


17 


15 


213419 at 


i 


APBB2 


4 


94 


213013 at 


i 


MAPK8IP1 


11 


16 


216750 at 


i 


APBB2 


4 


95 


213014_at 


i 


MAPK8IP1 


11 


17 


216747 at 


i 


APBB2 


4 


96 


208603_s_at 


i 


MAPK8IP2 


22 


18 


40148 at 


i 


APBB2 


4 


97 


205050 s at 


i 


MAPK8IP2 


22 


19 


212970 at 


i 


APBB2 


4 


98 


20243 l_s_at 


i 


MYC 


8 


20 


212985 at 


i 


APBB2 


4 


99 


202268 s at 


i 


NAE1 


16 


21 


204650 s at 


i 


APBB3 


5 


100 


208759 at 


i 


NCSTN 


1 


22 


200602 at 


i 


APP 


21 


101 


210720 s at 


i 


NECAB3 


20 


23 


214953 s at 


i 


APP 


21 


102 


211914 x at 


i 


NF1 


17 


24 


211277 x at 


i 


APP 


21 


103 


212676 at 


i 


NF1 


17 


25 


217867 x at 


i 


BACE2 


21 


104 


216115 at 


i 


NF1 


17 


26 


200837 at 


i 


BCAP31 


X 


105 


204325 s at 


i 


NF1 


17 


27 


200935 at 


i 


CALR 


19 


106 


211094 s at 


i 


NF1 


17 


28 


212952 at 


i 


CALR 


19 


107 


212678 at 


i 


NF1 


17 


29 


214315 x at 


i 


CALR 


19 


108 


210631 at 


i 


NF1 


17 


30 


212953 x at 


i 


CALR 


19 


109 


204323 x at 


i 


NF1 


17 


31 


214316 x at 


i 


CALR 


19 


110 


211095 at 


i 


NF1 


17 


32 


214322 at 


i 


CAMK2G 


10 


111 


207545 s at 


i 


NUMB 


14 


33 


212669 at 


i 


CAMK2G 


10 


112 


209073 s at 


i 


NUMB 


14 


34 


212757 s at 


i 


CAMK2G 


10 


113 


209615 s at 


i 


PAK1 


11 


35 


202763 at 


i 


CASP3 


4 


114 


221280 s at 


i 


PARD3 


10 


36 


212063 at 


i 


CD44 


11 


115 


221527_s_at 


i 


PARD3 


10 


37 


204489 s at 


i 


CD44 


11 


116 


221526 x at 


i 


PARD3 


10 


38 


210916 s at 


i 


CD44 


11 


117 


210094 s at 


i 


PARD3 


10 
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Table 5.10 (con't.) 



No. 


AffylD 


Sign 


Gene 


Chr. 


No. 


AffylD 


Sign 


Gene 


Chr. 


39 


204490 s at 


i 


CD44 


11 


118 


202927_at 


i 


PIN1 


19 


40 


212014 x at 


i 


CD44 


11 


119 


204186 s at 


i 


PPID 


4 


41 


217523 at 


i 


CD44 


11 


120 


204185 x at 


i 


PPID 


4 


42 


209835 x at 


i 


CD44 


11 


121 


202429 s at 


i 


PPP3CA 


4 


43 


216056 at 


i 


CD44 


11 


122 


202425 x at 


i 


PPP3CA 


4 


44 


204247 s at 


i 


CDK5 


7 


123 


202457 s at 


i 


PPP3CA 


4 


45 


210123 s at 


i 


CHRNA7 


15 


124 


215195 at 


i 


PRKCA 


17 


46 


208792 s at 


i 


CLU 


8 


125 


215194 at 


i 


PRKCA 


17 


47 


208791 at 


i 


CLU 


8 


126 


213093 at 


i 


PRKCA 


17 


48 


222043 at 


i 


CLU 


8 


127 


206923_at 


i 


PRKCA 


17 


49 


200766 at 


i 


CTSD 


11 


128 


20426 l_s_at 


i 


PSEN2 


1 


50 


217341 at 


i 


DNM1 


9 


129 


204262 s at 


i 


PSEN2 


1 


51 


215116 s at 


i 


DNM1 


9 


130 


211373_s_at 


i 


PSEN2 


1 


52 


211484 s at 


i 


DSCAM 


21 


131 


208641 s at 


i 


RAC1 


7 


53 


202711 at 


i 


EFNB1 


X 


132 


208370_s_at 


i 


RCAN1 


21 


54 


203499 at 


i 


EPHA2 


1 


133 


215253 s at 


i 


RCAN1 


21 


55 


210651 s at 


i 


EPHB2 


1 


134 


202376 at 


i 


SERPINA3 


14 


56 


211165 x at 


i 


EPHB2 


1 


135 


201469 s at 


i 


SHC1 


1 


57 


209589 s at 


i 


EPHB2 


1 


136 


214853 s at 


i 


SHC1 


1 


58 


209588 at 


i 


EPHB2 


1 


137 


206330 s at 


i 


SHC3 


9 


59 


217887 s at 


i 


EPS 15 


1 


138 


214544 s at 


i 


SNAP23 


15 


60 


217886 at 


i 


EPS 15 


1 


139 


209130 at 


i 


SNAP23 


15 


61 


215961 at 


i 


F12 


5 


140 


209131_s_at 


i 


SNAP23 


15 


62 


205774 at 


i 


F12 


5 


141 


202508_s_at 


i 


SNAP25 


20 


63 


215075_s_at 


i 


GRB2 


17 


142 


202507 s at 


i 


SNAP25 


20 


64 


209945 s at 


i 


GSK3B 


3 


143 


212777 at 


i 


SOS1 


2 


65 


200696 s at 


i 


GSN 


9 


144 


212780 at 


i 


SOS1 


2 


66 


214040 s at 


i 


GSN 


9 


145 


221281 at 


i 


SRC 


20 


67 


218121 at 


i 


HMOX2 


16 


146 


213324 at 


i 


SRC 


20 


68 


218120 s at 


i 


HMOX2 


16 


147 


221284 s at 


i 


SRC 


20 


69 


217080 s at 


i 


HOMER2 


15 


148 


203084 at 


i 


TGFB1 


19 


70 


212983_at 


i 


HRAS 


11 


149 


203085_s_at 


i 


TGFB1 


19 


71 


202282 at 


i 


HSD17B10 


X 


150 


220407_s_at 


i 


TGFB2 


1 


72 


211152 s at 


i 


HTRA2 


2 


151 


209908 s at 


i 


TGFB2 


1 


73 


203089 s at 


i 


HTRA2 


2 


152 


220406 at 


i 


TGFB2 


1 


74 


217731 s at 


i 


ITM2B 


13 


153 


209909 s at 


i 


TGFB2 


1 


75 


217732 s at 


i 


ITM2B 


13 


154 


213135 at 


i 


TIAM1 


21 


76 


207322 at 


i 


ITSN1 


21 


155 


206409 at 


i 


TIAM1 


21 


77 


209298 s at 


i 


ITSN1 


21 


156 


205810 s at 


i 


WASL 


7 


78 


209297 at 


i 


ITSN1 


21 


157 


205809 s at 


i 


WASL 


7 


79 


35776 at 


i 


ITSN1 


21 
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Table 5.11: Drugs predicted to enhance expression of Hsa21 gene set-2 



Drug (concentration) 


Target or disease 


Estradiol (lOnM) 


Treatment of urogenital symptoms associated with post- 
menopausal atrophy of the vagina and/or the lower urinary 
tract. 


4,5-dianilinophthalimide (lOmM) 


Potent inhibitors of amyloid P fibrillization. 


Colforsin (50mM) 


Novel drug for the treatment of acute heart failure. 



Table 5.12 : Drugs predicted to repress expression of Hsa21 gene set-2 



Drug (concentration) 


Target or disease 


TrichostatinA(l.mM, lOO.nM) 


An organic compound that serves as an antifungal 
antibiotic and selectively inhibits the class I and II 
mammalian histone deacetylase families of enzymes, but 
not class III HDACs. 


Geldanamycin (l.mM) 


A benzoquinone ansamycin antibiotic that binds to Hsp90 
and inhibits its function. HSP90 proteins play important 
roles in the regulation of the cell cycle, cell growth, cell 
survival, apoptosis, angiogenesis and oncogenesis. 


Podophyllotoxin (9.6mM) 


Used in the treatment of many cancers, particularly 
small cell lung carcinoma and testicular cancer and 
brain tumors. It arrests cell growth by inhibiting DNA 
topo-isomerase II, which causes double strand breaks 
in DNA. 


Cytochalasin B (20.8mM) 


A cell-permeable mycotoxin. It inhibits cytoplasmic, 
division by blocking the formation of contractile 
microfilaments. It inhibits cell movement and induces 
nuclear extrusion. 
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Table 5.12 (con't.) 



Drug (concentration) 


Target or disease 


Propafenone (10. 6mM) 


Used to prevent or treat serious irregular heartbeats 
(arrhythmia). Belongs to a class of drugs called 
antiarrhythmics. 


Chlorpromazine (1 1.2mM) 


It is used in the treatment of various psychiatric 
illnesses and is also used in the management of nausea 
and vomiting associated with terminal illness. 


Tanespimycin (l.mM) 


It helps cause the breakdown of certain proteins in the cell, 
and may kill cancer cells. It is a type of antineoplastic 
antibiotic and a type of HSP90 inhibitor. 



5.3 Drug target prediction 

In this application, we search for the set of genes affected by a given set of drug 
treatments as follows: 

1) Relate the given set of drug treatments with drug treatments in the CMAP 

a. Use Drug Bank chemical similarity. 

b. Based on overlap of drug targets. 

2) Extract expression profiles of the CMAP drug treatments related to the given 
drug treatments. 

3) Generate the gene expression signature as in Section 5.2.1. 

4) Apply RPPR [169] to the gene expression signature to expand the selected 
expression profiles. 



216 



5) Apply our clustering methods to search for the group of genes most 
differentially expressed. This set of genes is the answer to the problem. 

5.4 Application of HIGEDA into prediction regulatory motifs 

HIGEDA is capable of discovering new motifs in a sequence set. In this 
application, we use HIGEDA to discover transcription factor binding sites (TFBS) and 
miRNA complementary binding sites (MBS) in the genomic and mRNA sequences of 
genes within each cluster from Section 5.1. Figure 5-6 illustrates the proposed method. 
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Figure 5-6 : Transcription factor and mircoRNA binding site prediction using HIGEDA 
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6. Conclusions and future work 

Our studies regarding microarray analysis included meta-analysis methods, cluster 
analysis of gene expression data, and motif finding in biological sequences. Use of 
meta-analysis methods should provide better results in analysis of microarray data 
obtained from multiple studies because of a stronger statistical basis, and a reduction in 
bias caused by results of any individual study. 

In cluster analysis, we showed that soft cluster methods are more appropriate for 
gene expression data because it allows genes to belong to multiple clusters, reflecting 
their participation in more than one biological process. Use of the FCM algorithm is 
optimal because it eliminates problems seen in the EM algorithm of slow convergence 
at regions of overlap between adjacent clusters and its assumption about the data 
distribution model may violate that of most microarray data that do not follow a specific 
distribution model, such as Gaussian or Chi-Squared. We also proved that, by using the 
possibility to probability model conversion, we can apply probability based model 
selection to possibility based models. We therefore benefit from the power of statistical 
methods but also effectively model the natural data distributions because of the 
flexibility of possibility based models. The advantages of our approach have been 
shown in our novel methods of fuzzy cluster validation using either gene expression 
data or GO annotations, missing data imputation, fuzzy cluster defuzzification and 
cluster number detection. 



218 



In the motif-finding problem, we showed that a PWM based model is most 
appropriate. With the use of the hierarchical genetics algorithm where patterns of sub- 
sequences in motifs can change simultaneously, the motif-finding algorithm can avoid 
local optima. In addition, with the use of the dynamic programming algorithm, gapped 
motifs can be discovered. 

In this dissertation, we have made the following contributions to bioinformatics 
research: 

1) Development of a meta-analysis method to effectively filter differentially 
expressed genes using external metrics, information based on the specific biological 
context, and prior knowledge of existing biological pathways, PPIs and Gene Ontology. 
The method is applicable to cases where data come from different laboratories using 
different versions of different microarray platforms, and should provide more 
biologically relevant results. We further propose a novel use of the Connectivity Map 
datasets to predict gene-drug associations. 

2) Development of a novel clustering algorithm that combines algorithms 
developed for FCM parameter initialization, FCM missing data imputation, 
defuzzification of FCM fuzzy partition and fuzzy cluster evaluation with the 
optimization algorithm, GA. The algorithm is applicable to cluster analysis of gene 
expression data without a prior specification of the number of clusters and assumption 
of the data distributions. Our clustering method successfully supported specific 
applications described in Section 5. 
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3) Development of a novel motif-finding algorithm for the problem of gene 
regulatory sequence prediction for groups of genes with similar expression patterns 
under some biological conditions of interest, demonstrated in Section 4.3.2. 

4) Development of a machine learning based framework for a complete solution to 
the problem of microarray data analysis. 

5) We have developed a website [168] where the methods of this dissertation are 
made available online. The website uses R software as the computational back-end so 
that our methods can work seamlessly with other statistical, machine learning and data 
mining methods available in R packages. We will update our website with our machine 
learning based framework to allow microarray data analysis online. 

In future work, we will improve our algorithm methods with parallelization and 
implement them as web applications to support gene-drug-disease association 
prediction. These applications will be run on High Performance Computing (HPC) and 
accessible online. In addition, we will adapt our methods to other research domains 
such as Magnetic Resonance Imaging (MRI) data analysis, Social network information 
retrieval, cluster analysis of internet websites and users data. For the MRI data analysis, 
our cluster methods can help with image segmentation, marker identification and cluster 
analysis. For social network information retrieval, our cluster methods can help with 
identification the group of users having similar social behaviors, namely entertainment, 
buying and travelling. 
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